The term malware is a contraction of malicious software. Put simply, malware is any piece of software that was written with the intent of doing harm to data, devices or to people.
Source: https://www.avg.com/en/signal/what-is-malware
In the past few years, the malware industry has grown very rapidly that, the syndicates invest heavily in technologies to evade traditional protection, forcing the anti-malware groups/communities to build more robust softwares to detect and terminate these attacks. The major part of protecting a computer system from a malware attack is to identify whether a given piece of file/software is a malware.
Microsoft has been very active in building anti-malware products over the years and it runs it’s anti-malware utilities over 150 million computers around the world. This generates tens of millions of daily data points to be analyzed as potential malware. In order to be effective in analyzing and classifying such large amounts of data, we need to be able to group them into groups and identify their respective families.
This dataset provided by Microsoft contains about 9 classes of malware.
,
Source: https://www.kaggle.com/c/malware-classification
.asm file
.text:00401000 assume es:nothing, ss:nothing, ds:_data, fs:nothing, gs:nothing .text:00401000 56 push esi .text:00401001 8D 44 24 08 lea eax, [esp+8] .text:00401005 50 push eax .text:00401006 8B F1 mov esi, ecx .text:00401008 E8 1C 1B 00 00 call ??0exception@std@@QAE@ABQBD@Z ; std::exception::exception(char const * const &) .text:0040100D C7 06 08 BB 42 00 mov dword ptr [esi], offset off_42BB08 .text:00401013 8B C6 mov eax, esi .text:00401015 5E pop esi .text:00401016 C2 04 00 retn 4 .text:00401016 ; --------------------------------------------------------------------------- .text:00401019 CC CC CC CC CC CC CC align 10h .text:00401020 C7 01 08 BB 42 00 mov dword ptr [ecx], offset off_42BB08 .text:00401026 E9 26 1C 00 00 jmp sub_402C51 .text:00401026 ; --------------------------------------------------------------------------- .text:0040102B CC CC CC CC CC align 10h .text:00401030 56 push esi .text:00401031 8B F1 mov esi, ecx .text:00401033 C7 06 08 BB 42 00 mov dword ptr [esi], offset off_42BB08 .text:00401039 E8 13 1C 00 00 call sub_402C51 .text:0040103E F6 44 24 08 01 test byte ptr [esp+8], 1 .text:00401043 74 09 jz short loc_40104E .text:00401045 56 push esi .text:00401046 E8 6C 1E 00 00 call ??3@YAXPAX@Z ; operator delete(void *) .text:0040104B 83 C4 04 add esp, 4 .text:0040104E .text:0040104E loc_40104E: ; CODE XREF: .text:00401043j .text:0040104E 8B C6 mov eax, esi .text:00401050 5E pop esi .text:00401051 C2 04 00 retn 4 .text:00401051 ; ---------------------------------------------------------------------------
.bytes file
00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20 00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01 00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18 00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04 00401040 82 20 08 83 00 08 00 00 00 00 02 00 60 80 10 80 00401050 18 00 00 20 A9 00 00 00 00 04 04 78 01 02 70 90 00401060 00 02 00 08 20 12 00 00 00 40 10 00 80 00 40 19 00401070 00 00 00 00 11 20 80 04 80 10 00 20 00 00 25 00 00401080 00 00 01 00 00 04 00 10 02 C1 80 80 00 20 20 00 00401090 08 A0 01 01 44 28 00 00 08 10 20 00 02 08 00 00 004010A0 00 40 00 00 00 34 40 40 00 04 00 08 80 08 00 08 004010B0 10 00 40 00 68 02 40 04 E1 00 28 14 00 08 20 0A 004010C0 06 01 02 00 40 00 00 00 00 00 00 20 00 02 00 04 004010D0 80 18 90 00 00 10 A0 00 45 09 00 10 04 40 44 82 004010E0 90 00 26 10 00 00 04 00 82 00 00 00 20 40 00 00 004010F0 B4 00 00 40 00 02 20 25 08 00 00 00 00 00 00 00 00401100 08 00 00 50 00 08 40 50 00 02 06 22 08 85 30 00 00401110 00 80 00 80 60 00 09 00 04 20 00 00 00 00 00 00 00401120 00 82 40 02 00 11 46 01 4A 01 8C 01 E6 00 86 10 00401130 4C 01 22 00 64 00 AE 01 EA 01 2A 11 E8 10 26 11 00401140 4E 11 8E 11 C2 00 6C 00 0C 11 60 01 CA 00 62 10 00401150 6C 01 A0 11 CE 10 2C 11 4E 10 8C 00 CE 01 AE 01 00401160 6C 10 6C 11 A2 01 AE 00 46 11 EE 10 22 00 A8 00 00401170 EC 01 08 11 A2 01 AE 10 6C 00 6E 00 AC 11 8C 00 00401180 EC 01 2A 10 2A 01 AE 00 40 00 C8 10 48 01 4E 11 00401190 0E 00 EC 11 24 10 4A 10 04 01 C8 11 E6 01 C2 00
There are nine different classes of malware that we need to classify a given a data point => Multi class classification problem
Source: https://www.kaggle.com/c/malware-classification#evaluation
Metric(s):
Objective: Predict the probability of each data-point belonging to each of the nine classes.
Constraints:
Split the dataset randomly into three parts train, cross validation and test with 64%,16%, 20% of data respectively
http://blog.kaggle.com/2015/05/26/microsoft-malware-winners-interview-1st-place-no-to-overfitting/
https://arxiv.org/pdf/1511.04317.pdf
First place solution in Kaggle competition: https://www.youtube.com/watch?v=VLQTRlLGz5Y
https://github.com/dchad/malware-detection
http://vizsec.org/files/2011/Nataraj.pdf
https://www.dropbox.com/sh/gfqzv0ckgs4l1bf/AAB6EelnEjvvuQg2nu_pIB6ua?dl=0
" Cross validation is more trustworthy than domain knowledge."
import warnings
warnings.filterwarnings("ignore")
import shutil
import os
import pandas as pd
import matplotlib
matplotlib.use(u'nbAgg')
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pickle
from sklearn.manifold import TSNE
from sklearn import preprocessing
import pandas as pd
from multiprocessing import Process# this is used for multithreading
import multiprocessing
import codecs# this is used for file operations
import random as r
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
if False:
#separating byte files and asm files
source = 'train'
destination_1 = 'byteFiles'
destination_2 = 'asmFiles'
# we will check if the folder 'byteFiles' exists if it not there we will create a folder with the same name
if not os.path.isdir(destination_1):
os.makedirs(destination_1)
if not os.path.isdir(destination_2):
os.makedirs(destination_2)
# if we have folder called 'train' (train folder contains both .asm files and .bytes files) we will rename it 'asmFiles'
# for every file that we have in our 'asmFiles' directory we check if it is ending with .bytes, if yes we will move it to
# 'byteFiles' folder
# so by the end of this snippet we will separate all the .byte files and .asm files
if os.path.isdir(source):
data_files = os.listdir(source)
for file in data_files:
print(file)
if (file.endswith("bytes")):
shutil.move(source+'\\'+file,destination_1)
if (file.endswith("asm")):
shutil.move(source+'\\'+file,destination_2)
Y=pd.read_csv("trainLabels.csv")
total = len(Y)*1.
ax=sns.countplot(x="Class", data=Y)
for p in ax.patches:
ax.annotate('{:.1f}%'.format(100*p.get_height()/total), (p.get_x()+0.1, p.get_height()+5))
#put 11 ticks (therefore 10 steps), from 0 to the total number of rows in the dataframe
ax.yaxis.set_ticks(np.linspace(0, total, 11))
#adjust the ticklabel to the desired format, without changing the position of the ticks.
ax.set_yticklabels(map('{:.1f}%'.format, 100*ax.yaxis.get_majorticklocs()/total))
plt.show()
if not os.path.exists('data_size_byte.csv'):
#file sizes of byte files
files=os.listdir('byteFiles')
filenames=Y['Id'].tolist()
class_y=Y['Class'].tolist()
class_bytes=[]
sizebytes=[]
fnames=[]
for file in files:
# print(os.stat('byteFiles/0A32eTdBKayjCWhZqDOQ.txt'))
# os.stat_result(st_mode=33206, st_ino=1125899906874507, st_dev=3561571700, st_nlink=1, st_uid=0, st_gid=0,
# st_size=3680109, st_atime=1519638522, st_mtime=1519638522, st_ctime=1519638522)
# read more about os.stat: here https://www.tutorialspoint.com/python/os_stat.htm
statinfo=os.stat('byteFiles/'+file)
# split the file name at '.' and take the first part of it i.e the file name
file=file.split('.')[0]
if any(file == filename for filename in filenames):
i=filenames.index(file)
class_bytes.append(class_y[i])
# converting into Mb's
sizebytes.append(statinfo.st_size/(1024.0*1024.0))
fnames.append(file)
data_size_byte=pd.DataFrame({'ID':fnames,'size':sizebytes,'Class':class_bytes})
data_size_byte.to_csv('data_size_byte.csv', index = False)
else:
data_size_byte = pd.read_csv('data_size_byte.csv')
print (data_size_byte.head())
ID size Class 0 01azqd4InC7m9JpocGv5 5.012695 9 1 01IsoiSMh5gxyDYTl4CB 6.556152 2 2 01jsnpXSAlgw6aPeDxrU 4.602051 9 3 01kcPWA9K2BOxQeS5Rju 0.679688 1 4 01SuzwMJEIXsK7A8dQbl 0.438965 8
#boxplot of byte files
ax = sns.boxplot(x="Class", y="size", data=data_size_byte)
plt.title("boxplot of .bytes file sizes")
plt.show()
if False:
#removal of addres from byte files
# contents of .byte files
# ----------------
#00401000 56 8D 44 24 08 50 8B F1 E8 1C 1B 00 00 C7 06 08
#-------------------
#we remove the starting address 00401000
files = os.listdir('byteFiles')
filenames=[]
array=[]
for file in files:
if(file.endswith("bytes")):
file=file.split('.')[0]
text_file = open('byteFiles/'+file+".txt", 'w+')
with open('byteFiles/'+file+".bytes","r") as fp:
lines=""
for line in fp:
a=line.rstrip().split(" ")[1:]
b=' '.join(a)
b=b+"\n"
text_file.write(b)
fp.close()
os.remove('byteFiles/'+file+".bytes")
text_file.close()
files = os.listdir('byteFiles')
filenames2=[]
feature_matrix = np.zeros((len(files),257),dtype=int)
k=0
#program to convert into bag of words of bytefiles
#this is custom-built bag of words this is unigram bag of words
byte_feature_file=open('result.csv','w+')
byte_feature_file.write("ID,0,1,2,3,4,5,6,7,8,9,0a,0b,0c,0d,0e,0f,10,11,12,13,14,15,16,17,18,19,1a,1b,1c,1d,1e,1f,20,21,22,23,24,25,26,27,28,29,2a,2b,2c,2d,2e,2f,30,31,32,33,34,35,36,37,38,39,3a,3b,3c,3d,3e,3f,40,41,42,43,44,45,46,47,48,49,4a,4b,4c,4d,4e,4f,50,51,52,53,54,55,56,57,58,59,5a,5b,5c,5d,5e,5f,60,61,62,63,64,65,66,67,68,69,6a,6b,6c,6d,6e,6f,70,71,72,73,74,75,76,77,78,79,7a,7b,7c,7d,7e,7f,80,81,82,83,84,85,86,87,88,89,8a,8b,8c,8d,8e,8f,90,91,92,93,94,95,96,97,98,99,9a,9b,9c,9d,9e,9f,a0,a1,a2,a3,a4,a5,a6,a7,a8,a9,aa,ab,ac,ad,ae,af,b0,b1,b2,b3,b4,b5,b6,b7,b8,b9,ba,bb,bc,bd,be,bf,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,ca,cb,cc,cd,ce,cf,d0,d1,d2,d3,d4,d5,d6,d7,d8,d9,da,db,dc,dd,de,df,e0,e1,e2,e3,e4,e5,e6,e7,e8,e9,ea,eb,ec,ed,ee,ef,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,fa,fb,fc,fd,fe,ff,??")
byte_feature_file.write("\n")
for file in files:
filenames2.append(file)
byte_feature_file.write(file+",")
if(file.endswith("txt")):
with open('byteFiles/'+file,"r") as byte_flie:
for lines in byte_flie:
line=lines.rstrip().split(" ")
for hex_code in line:
if hex_code=='??':
feature_matrix[k][256]+=1
else:
feature_matrix[k][int(hex_code,16)]+=1
byte_flie.close()
for i, row in enumerate(feature_matrix[k]):
if i!=len(feature_matrix[k])-1:
byte_feature_file.write(str(row)+",")
else:
byte_feature_file.write(str(row))
byte_feature_file.write("\n")
k += 1
byte_feature_file.close()
byte_features=pd.read_csv("result.csv")
byte_features['ID'] = byte_features['ID'].str.split('.').str[0]
byte_features.head(2)
| ID | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | f7 | f8 | f9 | fa | fb | fc | fd | fe | ff | ?? | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 01azqd4InC7m9JpocGv5 | 601905 | 3905 | 2816 | 3832 | 3345 | 3242 | 3650 | 3201 | 2965 | ... | 2804 | 3687 | 3101 | 3211 | 3097 | 2758 | 3099 | 2759 | 5753 | 1824 |
| 1 | 01IsoiSMh5gxyDYTl4CB | 39755 | 8337 | 7249 | 7186 | 8663 | 6844 | 8420 | 7589 | 9291 | ... | 451 | 6536 | 439 | 281 | 302 | 7639 | 518 | 17001 | 54902 | 8588 |
2 rows × 258 columns
data_size_byte.head(2)
| ID | size | Class | |
|---|---|---|---|
| 0 | 01azqd4InC7m9JpocGv5 | 5.012695 | 9 |
| 1 | 01IsoiSMh5gxyDYTl4CB | 6.556152 | 2 |
byte_features_with_size = byte_features.merge(data_size_byte, on='ID')
byte_features_with_size.to_csv("result_with_size.csv")
byte_features_with_size.head(2)
| ID | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | f9 | fa | fb | fc | fd | fe | ff | ?? | size | Class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 01azqd4InC7m9JpocGv5 | 601905 | 3905 | 2816 | 3832 | 3345 | 3242 | 3650 | 3201 | 2965 | ... | 3101 | 3211 | 3097 | 2758 | 3099 | 2759 | 5753 | 1824 | 5.012695 | 9 |
| 1 | 01IsoiSMh5gxyDYTl4CB | 39755 | 8337 | 7249 | 7186 | 8663 | 6844 | 8420 | 7589 | 9291 | ... | 439 | 281 | 302 | 7639 | 518 | 17001 | 54902 | 8588 | 6.556152 | 2 |
2 rows × 260 columns
# https://stackoverflow.com/a/29651514
def normalize(df):
result1 = df.copy()
for feature_name in df.columns:
if (str(feature_name) != str('ID') and str(feature_name)!=str('Class')):
max_value = df[feature_name].max()
min_value = df[feature_name].min()
result1[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
return result1
result = normalize(byte_features_with_size)
result.head(2)
| ID | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | f9 | fa | fb | fc | fd | fe | ff | ?? | size | Class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 01azqd4InC7m9JpocGv5 | 0.262806 | 0.005498 | 0.001567 | 0.002067 | 0.002048 | 0.001835 | 0.002058 | 0.002946 | 0.002638 | ... | 0.01356 | 0.013107 | 0.013634 | 0.031724 | 0.014549 | 0.014348 | 0.007843 | 0.000129 | 0.092219 | 9 |
| 1 | 01IsoiSMh5gxyDYTl4CB | 0.017358 | 0.011737 | 0.004033 | 0.003876 | 0.005303 | 0.003873 | 0.004747 | 0.006984 | 0.008267 | ... | 0.00192 | 0.001147 | 0.001329 | 0.087867 | 0.002432 | 0.088411 | 0.074851 | 0.000606 | 0.121237 | 2 |
2 rows × 260 columns
data_y = result['Class']
result.head()
| ID | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | f9 | fa | fb | fc | fd | fe | ff | ?? | size | Class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 01azqd4InC7m9JpocGv5 | 0.262806 | 0.005498 | 0.001567 | 0.002067 | 0.002048 | 0.001835 | 0.002058 | 0.002946 | 0.002638 | ... | 0.013560 | 0.013107 | 0.013634 | 0.031724 | 0.014549 | 0.014348 | 0.007843 | 0.000129 | 0.092219 | 9 |
| 1 | 01IsoiSMh5gxyDYTl4CB | 0.017358 | 0.011737 | 0.004033 | 0.003876 | 0.005303 | 0.003873 | 0.004747 | 0.006984 | 0.008267 | ... | 0.001920 | 0.001147 | 0.001329 | 0.087867 | 0.002432 | 0.088411 | 0.074851 | 0.000606 | 0.121237 | 2 |
| 2 | 01jsnpXSAlgw6aPeDxrU | 0.040827 | 0.013434 | 0.001429 | 0.001315 | 0.005464 | 0.005280 | 0.005078 | 0.002155 | 0.008104 | ... | 0.009804 | 0.011777 | 0.012604 | 0.028423 | 0.013080 | 0.013937 | 0.067001 | 0.000033 | 0.084499 | 9 |
| 3 | 01kcPWA9K2BOxQeS5Rju | 0.009209 | 0.001708 | 0.000404 | 0.000441 | 0.000770 | 0.000354 | 0.000310 | 0.000481 | 0.000959 | ... | 0.002121 | 0.001886 | 0.002272 | 0.013032 | 0.002211 | 0.003957 | 0.010904 | 0.000984 | 0.010759 | 1 |
| 4 | 01SuzwMJEIXsK7A8dQbl | 0.008629 | 0.001000 | 0.000168 | 0.000234 | 0.000342 | 0.000232 | 0.000148 | 0.000229 | 0.000376 | ... | 0.001530 | 0.000853 | 0.001052 | 0.007511 | 0.001038 | 0.001258 | 0.002998 | 0.000636 | 0.006233 | 8 |
5 rows × 260 columns
#multivariate analysis on byte files
#this is with perplexity 50
xtsne=TSNE(perplexity=50)
results=xtsne.fit_transform(result.drop(['ID','Class'], axis=1))
vis_x = results[:, 0]
vis_y = results[:, 1]
plt.scatter(vis_x, vis_y, c=data_y, cmap=plt.cm.get_cmap("jet", 9))
plt.colorbar(ticks=range(10))
plt.clim(0.5, 9)
plt.show()
#this is with perplexity 30
xtsne=TSNE(perplexity=30)
results=xtsne.fit_transform(result.drop(['ID','Class'], axis=1))
vis_x = results[:, 0]
vis_y = results[:, 1]
plt.scatter(vis_x, vis_y, c=data_y, cmap=plt.cm.get_cmap("jet", 9))
plt.colorbar(ticks=range(10))
plt.clim(0.5, 9)
plt.show()
data_y = result['Class']
# split the data into test and train by maintaining same distribution of output varaible 'y_true' [stratify=y_true]
X_train, X_test, y_train, y_test = train_test_split(result.drop(['ID','Class'], axis=1), data_y,stratify=data_y,test_size=0.20)
# split the train data into train and cross validation by maintaining same distribution of output varaible 'y_train' [stratify=y_train]
X_train, X_cv, y_train, y_cv = train_test_split(X_train, y_train,stratify=y_train,test_size=0.20)
print('Number of data points in train data:', X_train.shape[0])
print('Number of data points in test data:', X_test.shape[0])
print('Number of data points in cross validation data:', X_cv.shape[0])
Number of data points in train data: 6955 Number of data points in test data: 2174 Number of data points in cross validation data: 1739
# it returns a dict, keys as class labels and values as the number of data points in that class
train_class_distribution = y_train.value_counts().sort_values()
test_class_distribution = y_test.value_counts().sort_values()
cv_class_distribution = y_cv.value_counts().sort_values()
my_colors = ['r','g','b','k','y','m','c']
train_class_distribution.plot(kind='bar', color=my_colors)
plt.xlabel('Class')
plt.ylabel('Data points per Class')
plt.title('Distribution of yi in train data')
plt.grid()
plt.show()
# ref: argsort https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html
# -(train_class_distribution.values): the minus sign will give us in decreasing order
sorted_yi = np.argsort(-train_class_distribution.values)
for i in sorted_yi:
print('Number of data points in class', i+1, ':',train_class_distribution.values[i], '(', np.round((train_class_distribution.values[i]/y_train.shape[0]*100), 3), '%)')
print('-'*80)
my_colors = ['r','g','b','k','y','m','c']
test_class_distribution.plot(kind='bar', color=my_colors)
plt.xlabel('Class')
plt.ylabel('Data points per Class')
plt.title('Distribution of yi in test data')
plt.grid()
plt.show()
# ref: argsort https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html
# -(train_class_distribution.values): the minus sign will give us in decreasing order
sorted_yi = np.argsort(-test_class_distribution.values)
for i in sorted_yi:
print('Number of data points in class', i+1, ':',test_class_distribution.values[i], '(', np.round((test_class_distribution.values[i]/y_test.shape[0]*100), 3), '%)')
print('-'*80)
my_colors = ['r','g','b','k','y','m','c']
cv_class_distribution.plot(kind='bar', color=my_colors)
plt.xlabel('Class')
plt.ylabel('Data points per Class')
plt.title('Distribution of yi in cross validation data')
plt.grid()
plt.show()
# ref: argsort https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html
# -(train_class_distribution.values): the minus sign will give us in decreasing order
sorted_yi = np.argsort(-train_class_distribution.values)
for i in sorted_yi:
print('Number of data points in class', i+1, ':',cv_class_distribution.values[i], '(', np.round((cv_class_distribution.values[i]/y_cv.shape[0]*100), 3), '%)')
Number of data points in class 9 : 1883 ( 27.074 %) Number of data points in class 8 : 1586 ( 22.804 %) Number of data points in class 7 : 986 ( 14.177 %) Number of data points in class 6 : 786 ( 11.301 %) Number of data points in class 5 : 648 ( 9.317 %) Number of data points in class 4 : 481 ( 6.916 %) Number of data points in class 3 : 304 ( 4.371 %) Number of data points in class 2 : 254 ( 3.652 %) Number of data points in class 1 : 27 ( 0.388 %) --------------------------------------------------------------------------------
Number of data points in class 9 : 588 ( 27.047 %) Number of data points in class 8 : 496 ( 22.815 %) Number of data points in class 7 : 308 ( 14.167 %) Number of data points in class 6 : 246 ( 11.316 %) Number of data points in class 5 : 203 ( 9.338 %) Number of data points in class 4 : 150 ( 6.9 %) Number of data points in class 3 : 95 ( 4.37 %) Number of data points in class 2 : 80 ( 3.68 %) Number of data points in class 1 : 8 ( 0.368 %) --------------------------------------------------------------------------------
Number of data points in class 9 : 471 ( 27.085 %) Number of data points in class 8 : 396 ( 22.772 %) Number of data points in class 7 : 247 ( 14.204 %) Number of data points in class 6 : 196 ( 11.271 %) Number of data points in class 5 : 162 ( 9.316 %) Number of data points in class 4 : 120 ( 6.901 %) Number of data points in class 3 : 76 ( 4.37 %) Number of data points in class 2 : 64 ( 3.68 %) Number of data points in class 1 : 7 ( 0.403 %)
def plot_confusion_matrix(test_y, predict_y):
C = confusion_matrix(test_y, predict_y)
print("Number of misclassified points ",(len(test_y)-np.trace(C))/len(test_y)*100)
# C = 9,9 matrix, each cell (i,j) represents number of points of class i are predicted class j
A =(((C.T)/(C.sum(axis=1))).T)
#divid each element of the confusion matrix with the sum of elements in that column
# C = [[1, 2],
# [3, 4]]
# C.T = [[1, 3],
# [2, 4]]
# C.sum(axis = 1) axis=0 corresonds to columns and axis=1 corresponds to rows in two diamensional array
# C.sum(axix =1) = [[3, 7]]
# ((C.T)/(C.sum(axis=1))) = [[1/3, 3/7]
# [2/3, 4/7]]
# ((C.T)/(C.sum(axis=1))).T = [[1/3, 2/3]
# [3/7, 4/7]]
# sum of row elements = 1
B =(C/C.sum(axis=0))
#divid each element of the confusion matrix with the sum of elements in that row
# C = [[1, 2],
# [3, 4]]
# C.sum(axis = 0) axis=0 corresonds to columns and axis=1 corresponds to rows in two diamensional array
# C.sum(axix =0) = [[4, 6]]
# (C/C.sum(axis=0)) = [[1/4, 2/6],
# [3/4, 4/6]]
labels = [1,2,3,4,5,6,7,8,9]
cmap=sns.light_palette("green")
# representing A in heatmap format
print("-"*50, "Confusion matrix", "-"*50)
plt.figure(figsize=(10,5))
sns.heatmap(C, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.show()
print("-"*50, "Precision matrix", "-"*50)
plt.figure(figsize=(10,5))
sns.heatmap(B, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.show()
print("Sum of columns in precision matrix",B.sum(axis=0))
# representing B in heatmap format
print("-"*50, "Recall matrix" , "-"*50)
plt.figure(figsize=(10,5))
sns.heatmap(A, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.show()
print("Sum of rows in precision matrix",A.sum(axis=1))
# we need to generate 9 numbers and the sum of numbers should be 1
# one solution is to genarate 9 numbers and divide each of the numbers by their sum
# ref: https://stackoverflow.com/a/18662466/4084039
test_data_len = X_test.shape[0]
cv_data_len = X_cv.shape[0]
# we create a output array that has exactly same size as the CV data
cv_predicted_y = np.zeros((cv_data_len,9))
for i in range(cv_data_len):
rand_probs = np.random.rand(1,9)
cv_predicted_y[i] = ((rand_probs/sum(sum(rand_probs)))[0])
print("Log loss on Cross Validation Data using Random Model",log_loss(y_cv,cv_predicted_y, eps=1e-15))
# Test-Set error.
#we create a output array that has exactly same as the test data
test_predicted_y = np.zeros((test_data_len,9))
for i in range(test_data_len):
rand_probs = np.random.rand(1,9)
test_predicted_y[i] = ((rand_probs/sum(sum(rand_probs)))[0])
print("Log loss on Test Data using Random Model",log_loss(y_test,test_predicted_y, eps=1e-15))
predicted_y =np.argmax(test_predicted_y, axis=1)
plot_confusion_matrix(y_test, predicted_y+1)
Log loss on Cross Validation Data using Random Model 2.4906058096334487 Log loss on Test Data using Random Model 2.453446709743105 Number of misclassified points 89.37442502299908 -------------------------------------------------- Confusion matrix --------------------------------------------------
-------------------------------------------------- Precision matrix --------------------------------------------------
Sum of columns in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.] -------------------------------------------------- Recall matrix --------------------------------------------------
Sum of rows in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.]
# find more about KNeighborsClassifier() here http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
# -------------------------
# default parameter
# KNeighborsClassifier(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2,
# metric=’minkowski’, metric_params=None, n_jobs=1, **kwargs)
# methods of
# fit(X, y) : Fit the model using X as training data and y as target values
# predict(X):Predict the class labels for the provided data
# predict_proba(X):Return probability estimates for the test data X.
#-------------------------------------
# video link: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/k-nearest-neighbors-geometric-intuition-with-a-toy-example-1/
#-------------------------------------
# find more about CalibratedClassifierCV here at http://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html
# ----------------------------
# default paramters
# sklearn.calibration.CalibratedClassifierCV(base_estimator=None, method=’sigmoid’, cv=3)
#
# some of the methods of CalibratedClassifierCV()
# fit(X, y[, sample_weight]) Fit the calibrated model
# get_params([deep]) Get parameters for this estimator.
# predict(X) Predict the target of new samples.
# predict_proba(X) Posterior probabilities of classification
#-------------------------------------
# video link:
#-------------------------------------
if not os.path.exists('models/uni_byte_knn.sav'):
alpha = [x for x in range(1, 15, 2)]
cv_log_error_array=[]
for i in alpha:
k_cfl=KNeighborsClassifier(n_neighbors=i)
k_cfl.fit(X_train,y_train)
sig_clf = CalibratedClassifierCV(k_cfl, method="sigmoid")
sig_clf.fit(X_train, y_train)
predict_y = sig_clf.predict_proba(X_cv)
cv_log_error_array.append(log_loss(y_cv, predict_y, labels=k_cfl.classes_, eps=1e-15))
for i in range(len(cv_log_error_array)):
print ('log_loss for k = ',alpha[i],'is',cv_log_error_array[i])
best_alpha = np.argmin(cv_log_error_array)
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
k_cfl=KNeighborsClassifier(n_neighbors=alpha[best_alpha])
k_cfl.fit(X_train,y_train)
sig_clf = CalibratedClassifierCV(k_cfl, method="sigmoid")
sig_clf.fit(X_train, y_train)
# save the model to disk
pickle.dump(sig_clf, open('models/uni_byte_knn.sav', 'wb'))
else:
# load the model from disk
sig_clf = pickle.load(open('models/uni_byte_knn.sav', 'rb'))
print(sig_clf)
predict_y = sig_clf.predict_proba(X_train)
print ('For values of best alpha = ', sig_clf.base_estimator.n_neighbors, "The train log loss is:",log_loss(y_train, predict_y))
predict_y = sig_clf.predict_proba(X_cv)
print('For values of best alpha = ', sig_clf.base_estimator.n_neighbors, "The cross validation log loss is:",log_loss(y_cv, predict_y))
predict_y = sig_clf.predict_proba(X_test)
print('For values of best alpha = ', sig_clf.base_estimator.n_neighbors, "The test log loss is:",log_loss(y_test, predict_y))
plot_confusion_matrix(y_test, sig_clf.predict(X_test))
CalibratedClassifierCV(base_estimator=KNeighborsClassifier(n_neighbors=3)) For values of best alpha = 3 The train log loss is: 0.1486818165735075 For values of best alpha = 3 The cross validation log loss is: 0.15419364640628 For values of best alpha = 3 The test log loss is: 0.15906862016382445 Number of misclassified points 3.863845446182153 -------------------------------------------------- Confusion matrix --------------------------------------------------
-------------------------------------------------- Precision matrix --------------------------------------------------
Sum of columns in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.] -------------------------------------------------- Recall matrix --------------------------------------------------
Sum of rows in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.]
# read more about SGDClassifier() at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
# ------------------------------
# default parameters
# SGDClassifier(loss=’hinge’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=None, tol=None,
# shuffle=True, verbose=0, epsilon=0.1, n_jobs=1, random_state=None, learning_rate=’optimal’, eta0=0.0, power_t=0.5,
# class_weight=None, warm_start=False, average=False, n_iter=None)
# some of methods
# fit(X, y[, coef_init, intercept_init, …]) Fit linear model with Stochastic Gradient Descent.
# predict(X) Predict class labels for samples in X.
#-------------------------------
# video link: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/geometric-intuition-1/
#------------------------------
if not os.path.exists('models/uni_byte_lr.sav'):
alpha = [10 ** x for x in range(-5, 4)]
cv_log_error_array=[]
for i in alpha:
logisticR=LogisticRegression(penalty='l2',C=i,class_weight='balanced')
logisticR.fit(X_train,y_train)
sig_clf = CalibratedClassifierCV(logisticR, method="sigmoid")
sig_clf.fit(X_train, y_train)
predict_y = sig_clf.predict_proba(X_cv)
cv_log_error_array.append(log_loss(y_cv, predict_y, labels=logisticR.classes_, eps=1e-15))
for i in range(len(cv_log_error_array)):
print ('log_loss for c = ',alpha[i],'is',cv_log_error_array[i])
best_alpha = np.argmin(cv_log_error_array)
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
logisticR=LogisticRegression(penalty='l2',C=alpha[best_alpha],class_weight='balanced')
logisticR.fit(X_train,y_train)
sig_clf = CalibratedClassifierCV(logisticR, method="sigmoid")
sig_clf.fit(X_train, y_train)
pred_y=sig_clf.predict(X_test)
# save the model to disk
pickle.dump(sig_clf, open('models/uni_byte_lr.sav', 'wb'))
else:
# load the model from disk
sig_clf = pickle.load(open('models/uni_byte_lr.sav', 'rb'))
print(sig_clf)
predict_y = sig_clf.predict_proba(X_train)
print ('log loss for train data',log_loss(y_train, predict_y, labels=sig_clf.base_estimator.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(X_cv)
print ('log loss for cv data',log_loss(y_cv, predict_y, labels=sig_clf.base_estimator.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(X_test)
print ('log loss for test data',log_loss(y_test, predict_y, labels=sig_clf.base_estimator.classes_, eps=1e-15))
plot_confusion_matrix(y_test, sig_clf.predict(X_test))
CalibratedClassifierCV(base_estimator=LogisticRegression(C=100,
class_weight='balanced'))
log loss for train data 0.8552221246951939
log loss for cv data 0.8755372295636145
log loss for test data 0.8560550254875295
Number of misclassified points 26.770929162833486
-------------------------------------------------- Confusion matrix --------------------------------------------------
-------------------------------------------------- Precision matrix --------------------------------------------------
Sum of columns in precision matrix [ 1. 1. 1. 1. nan 1. 1. 1. 1.] -------------------------------------------------- Recall matrix --------------------------------------------------
Sum of rows in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.]
# --------------------------------
# default parameters
# sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion=’gini’, max_depth=None, min_samples_split=2,
# min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0,
# min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False,
# class_weight=None)
# Some of methods of RandomForestClassifier()
# fit(X, y, [sample_weight]) Fit the SVM model according to the given training data.
# predict(X) Perform classification on samples in X.
# predict_proba (X) Perform classification on samples in X.
# some of attributes of RandomForestClassifier()
# feature_importances_ : array of shape = [n_features]
# The feature importances (the higher, the more important the feature).
# --------------------------------
# video link: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/random-forest-and-their-construction-2/
# --------------------------------
if not os.path.exists('models/uni_byte_rf.sav'):
alpha=[10,50,100,500,1000,2000,3000]
cv_log_error_array=[]
train_log_error_array=[]
from sklearn.ensemble import RandomForestClassifier
for i in alpha:
r_cfl=RandomForestClassifier(n_estimators=i,random_state=42,n_jobs=-1)
r_cfl.fit(X_train,y_train)
sig_clf = CalibratedClassifierCV(r_cfl, method="sigmoid")
sig_clf.fit(X_train, y_train)
predict_y = sig_clf.predict_proba(X_cv)
cv_log_error_array.append(log_loss(y_cv, predict_y, labels=r_cfl.classes_, eps=1e-15))
for i in range(len(cv_log_error_array)):
print ('log_loss for c = ',alpha[i],'is',cv_log_error_array[i])
best_alpha = np.argmin(cv_log_error_array)
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
r_cfl=RandomForestClassifier(n_estimators=alpha[best_alpha],random_state=42,n_jobs=-1)
r_cfl.fit(X_train,y_train)
sig_clf = CalibratedClassifierCV(r_cfl, method="sigmoid")
sig_clf.fit(X_train, y_train)
# save the model to disk
pickle.dump(sig_clf, open('models/uni_byte_rf.sav', 'wb'))
else:
# load the model from disk
sig_clf = pickle.load(open('models/uni_byte_rf.sav', 'rb'))
print(sig_clf)
predict_y = sig_clf.predict_proba(X_train)
print('For values of best alpha = ', sig_clf.base_estimator.n_estimators, "The train log loss is:",log_loss(y_train, predict_y))
predict_y = sig_clf.predict_proba(X_cv)
print('For values of best alpha = ', sig_clf.base_estimator.n_estimators, "The cross validation log loss is:",log_loss(y_cv, predict_y))
predict_y = sig_clf.predict_proba(X_test)
print('For values of best alpha = ', sig_clf.base_estimator.n_estimators, "The test log loss is:",log_loss(y_test, predict_y))
plot_confusion_matrix(y_test, sig_clf.predict(X_test))
CalibratedClassifierCV(base_estimator=RandomForestClassifier(n_estimators=500,
n_jobs=-1,
random_state=42))
For values of best alpha = 500 The train log loss is: 0.04501130259576106
For values of best alpha = 500 The cross validation log loss is: 0.04581601531690915
For values of best alpha = 500 The test log loss is: 0.04320847438163486
Number of misclassified points 0.5059797608095675
-------------------------------------------------- Confusion matrix --------------------------------------------------
-------------------------------------------------- Precision matrix --------------------------------------------------
Sum of columns in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.] -------------------------------------------------- Recall matrix --------------------------------------------------
Sum of rows in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.]
# Training a hyper-parameter tuned Xg-Boost regressor on our train data
# find more about XGBClassifier function here http://xgboost.readthedocs.io/en/latest/python/python_api.html?#xgboost.XGBClassifier
# -------------------------
# default paramters
# class xgboost.XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True,
# objective='binary:logistic', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1,
# max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1,
# scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)
# some of methods of RandomForestRegressor()
# fit(X, y, sample_weight=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None)
# get_params([deep]) Get parameters for this estimator.
# predict(data, output_margin=False, ntree_limit=0) : Predict with data. NOTE: This function is not thread safe.
# get_score(importance_type='weight') -> get the feature importance
# -----------------------
# video link1: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/regression-using-decision-trees-2/
# video link2: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/what-are-ensembles/
# -----------------------
if not os.path.exists('models/uni_byte_xgb.sav'):
alpha=[10,50,100,500,1000,2000]
cv_log_error_array=[]
for i in alpha:
x_cfl=XGBClassifier(n_estimators=i,nthread=-1)
x_cfl.fit(X_train,y_train)
sig_clf = CalibratedClassifierCV(x_cfl, method="sigmoid", eval_metric='mlogloss' )
sig_clf.fit(X_train, y_train)
predict_y = sig_clf.predict_proba(X_cv)
cv_log_error_array.append(log_loss(y_cv, predict_y, labels=x_cfl.classes_, eps=1e-15))
for i in range(len(cv_log_error_array)):
print ('log_loss for c = ',alpha[i],'is',cv_log_error_array[i])
best_alpha = np.argmin(cv_log_error_array)
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
x_cfl=XGBClassifier(n_estimators=alpha[best_alpha], nthread=-1, eval_metric='mlogloss' )
x_cfl.fit(X_train,y_train)
sig_clf = CalibratedClassifierCV(x_cfl, method="sigmoid")
sig_clf.fit(X_train, y_train)
# save the model to disk
pickle.dump(sig_clf, open('models/uni_byte_xgb.sav', 'wb'))
else:
# load the model from disk
sig_clf = pickle.load(open('models/uni_byte_xgb.sav', 'rb'))
print(sig_clf)
predict_y = sig_clf.predict_proba(X_train)
print ('For values of best alpha = ', sig_clf.base_estimator.n_estimators, "The train log loss is:",log_loss(y_train, predict_y))
predict_y = sig_clf.predict_proba(X_cv)
print('For values of best alpha = ', sig_clf.base_estimator.n_estimators, "The cross validation log loss is:",log_loss(y_cv, predict_y))
predict_y = sig_clf.predict_proba(X_test)
print('For values of best alpha = ', sig_clf.base_estimator.n_estimators, "The test log loss is:",log_loss(y_test, predict_y))
plot_confusion_matrix(y_test, sig_clf.predict(X_test))
CalibratedClassifierCV(base_estimator=XGBClassifier(base_score=0.5,
booster='gbtree',
colsample_bylevel=1,
colsample_bynode=1,
colsample_bytree=1,
enable_categorical=False,
gamma=0, gpu_id=-1,
importance_type=None,
interaction_constraints='',
learning_rate=0.300000012,
max_delta_step=0,
max_depth=6,
min_child_weight=1,
missing=nan,
monotone_constraints='()',
n_estimators=100, n_jobs=16,
nthread=-1,
num_parallel_tree=1,
objective='multi:softprob',
predictor='auto',
random_state=0, reg_alpha=0,
reg_lambda=1,
scale_pos_weight=None,
subsample=1,
tree_method='exact',
validate_parameters=1,
verbosity=None))
For values of best alpha = 100 The train log loss is: 0.04291138966331521
For values of best alpha = 100 The cross validation log loss is: 0.04722123070782947
For values of best alpha = 100 The test log loss is: 0.03249757011891755
Number of misclassified points 0.3219871205151794
-------------------------------------------------- Confusion matrix --------------------------------------------------
-------------------------------------------------- Precision matrix --------------------------------------------------
Sum of columns in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.] -------------------------------------------------- Recall matrix --------------------------------------------------
Sum of rows in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.]
if False:
# https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
x_cfl=XGBClassifier()
prams={
'learning_rate':[0.01,0.03,0.05,0.1,0.15,0.2],
'n_estimators':[100,200,500,1000,2000],
'max_depth':[3,5,10],
'colsample_bytree':[0.1,0.3,0.5,1],
'subsample':[0.1,0.3,0.5,1]
}
random_cfl1=RandomizedSearchCV(x_cfl,param_distributions=prams,verbose=10,n_jobs=-1,)
random_cfl1.fit(X_train,y_train)
if False:
print (random_cfl1.best_params_)
# Training a hyper-parameter tuned Xg-Boost regressor on our train data
# find more about XGBClassifier function here http://xgboost.readthedocs.io/en/latest/python/python_api.html?#xgboost.XGBClassifier
# -------------------------
# default paramters
# class xgboost.XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True,
# objective='binary:logistic', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1,
# max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1,
# scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)
# some of methods of RandomForestRegressor()
# fit(X, y, sample_weight=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None)
# get_params([deep]) Get parameters for this estimator.
# predict(data, output_margin=False, ntree_limit=0) : Predict with data. NOTE: This function is not thread safe.
# get_score(importance_type='weight') -> get the feature importance
# -----------------------
# video link2: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/what-are-ensembles/
# -----------------------
x_cfl=XGBClassifier(n_estimators=2000, learning_rate=0.15, colsample_bytree=0.5, max_depth=5)
x_cfl.fit(X_train,y_train)
c_cfl=CalibratedClassifierCV(x_cfl,method='sigmoid')
c_cfl.fit(X_train,y_train)
predict_y = c_cfl.predict_proba(X_train)
print ('train loss',log_loss(y_train, predict_y))
predict_y = c_cfl.predict_proba(X_cv)
print ('cv loss',log_loss(y_cv, predict_y))
predict_y = c_cfl.predict_proba(X_test)
print ('test loss',log_loss(y_test, predict_y))
[02:24:54] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [02:25:44] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [02:26:30] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [02:27:11] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [02:27:50] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [02:28:28] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. train loss 0.021426644404302808 cv loss 0.09046576633194085 test loss 0.07327204190976644
There are 10868 files of asm All the files make up about 150 GB The asm files contains : 1. Address 2. Segments 3. Opcodes 4. Registers 5. function calls 6. APIs With the help of parallel processing we extracted all the features.In parallel we can use all the cores that are present in our computer. Here we extracted 52 features from all the asm files which are important. We read the top solutions and handpicked the features from those papers/videos/blogs.
Refer:https://www.kaggle.com/c/malware-classification/discussion
if False:
#intially create five folders
#first
#second
#third
#fourth
#fifth
#this code tells us about random split of files into five folders
folder_1 ='first'
folder_2 ='second'
folder_3 ='third'
folder_4 ='fourth'
folder_5 ='fifth'
folder_6 = 'output'
for i in [folder_1,folder_2,folder_3,folder_4,folder_5,folder_6]:
if not os.path.isdir(i):
os.makedirs(i)
source='train/'
files = os.listdir('train')
#ID=df['Id'].tolist()
data=range(0,10868)
r.shuffle(data)
count=0
for i in range(0,10868):
if i % 5==0:
shutil.move(source+files[data[i]],'first')
elif i%5==1:
shutil.move(source+files[data[i]],'second')
elif i%5 ==2:
shutil.move(source+files[data[i]],'third')
elif i%5 ==3:
shutil.move(source+files[data[i]],'fourth')
elif i%5==4:
shutil.move(source+files[data[i]],'fifth')
#http://flint.cs.yale.edu/cs421/papers/x86-asm/asm.html
def firstprocess():
#The prefixes tells about the segments that are present in the asm files
#There are 450 segments(approx) present in all asm files.
#this prefixes are best segments that gives us best values.
#https://en.wikipedia.org/wiki/Data_segment
prefixes = ['HEADER:','.text:','.Pav:','.idata:','.data:','.bss:','.rdata:','.edata:','.rsrc:','.tls:','.reloc:','.BSS:','.CODE']
#this are opcodes that are used to get best results
#https://en.wikipedia.org/wiki/X86_instruction_listings
opcodes = ['jmp', 'mov', 'retf', 'push', 'pop', 'xor', 'retn', 'nop', 'sub', 'inc', 'dec', 'add','imul', 'xchg', 'or', 'shr', 'cmp', 'call', 'shl', 'ror', 'rol', 'jnb','jz','rtn','lea','movzx']
#best keywords that are taken from different blogs
keywords = ['.dll','std::',':dword']
#Below taken registers are general purpose registers and special registers
#All the registers which are taken are best
registers=['edx','esi','eax','ebx','ecx','edi','ebp','esp','eip']
file1=open("output\asmsmallfile.txt","w+")
files = os.listdir('first')
for f in files:
#filling the values with zeros into the arrays
prefixescount=np.zeros(len(prefixes),dtype=int)
opcodescount=np.zeros(len(opcodes),dtype=int)
keywordcount=np.zeros(len(keywords),dtype=int)
registerscount=np.zeros(len(registers),dtype=int)
features=[]
f2=f.split('.')[0]
file1.write(f2+",")
opcodefile.write(f2+" ")
# https://docs.python.org/3/library/codecs.html#codecs.ignore_errors
# https://docs.python.org/3/library/codecs.html#codecs.Codec.encode
with codecs.open('first/'+f,encoding='cp1252',errors ='replace') as fli:
for lines in fli:
# https://www.tutorialspoint.com/python3/string_rstrip.htm
line=lines.rstrip().split()
l=line[0]
#counting the prefixs in each and every line
for i in range(len(prefixes)):
if prefixes[i] in line[0]:
prefixescount[i]+=1
line=line[1:]
#counting the opcodes in each and every line
for i in range(len(opcodes)):
if any(opcodes[i]==li for li in line):
features.append(opcodes[i])
opcodescount[i]+=1
#counting registers in the line
for i in range(len(registers)):
for li in line:
# we will use registers only in 'text' and 'CODE' segments
if registers[i] in li and ('text' in l or 'CODE' in l):
registerscount[i]+=1
#counting keywords in the line
for i in range(len(keywords)):
for li in line:
if keywords[i] in li:
keywordcount[i]+=1
#pushing the values into the file after reading whole file
for prefix in prefixescount:
file1.write(str(prefix)+",")
for opcode in opcodescount:
file1.write(str(opcode)+",")
for register in registerscount:
file1.write(str(register)+",")
for key in keywordcount:
file1.write(str(key)+",")
file1.write("\n")
file1.close()
#same as above
def secondprocess():
prefixes = ['HEADER:','.text:','.Pav:','.idata:','.data:','.bss:','.rdata:','.edata:','.rsrc:','.tls:','.reloc:','.BSS:','.CODE']
opcodes = ['jmp', 'mov', 'retf', 'push', 'pop', 'xor', 'retn', 'nop', 'sub', 'inc', 'dec', 'add','imul', 'xchg', 'or', 'shr', 'cmp', 'call', 'shl', 'ror', 'rol', 'jnb','jz','rtn','lea','movzx']
keywords = ['.dll','std::',':dword']
registers=['edx','esi','eax','ebx','ecx','edi','ebp','esp','eip']
file1=open("output\mediumasmfile.txt","w+")
files = os.listdir('second')
for f in files:
prefixescount=np.zeros(len(prefixes),dtype=int)
opcodescount=np.zeros(len(opcodes),dtype=int)
keywordcount=np.zeros(len(keywords),dtype=int)
registerscount=np.zeros(len(registers),dtype=int)
features=[]
f2=f.split('.')[0]
file1.write(f2+",")
opcodefile.write(f2+" ")
with codecs.open('second/'+f,encoding='cp1252',errors ='replace') as fli:
for lines in fli:
line=lines.rstrip().split()
l=line[0]
for i in range(len(prefixes)):
if prefixes[i] in line[0]:
prefixescount[i]+=1
line=line[1:]
for i in range(len(opcodes)):
if any(opcodes[i]==li for li in line):
features.append(opcodes[i])
opcodescount[i]+=1
for i in range(len(registers)):
for li in line:
if registers[i] in li and ('text' in l or 'CODE' in l):
registerscount[i]+=1
for i in range(len(keywords)):
for li in line:
if keywords[i] in li:
keywordcount[i]+=1
for prefix in prefixescount:
file1.write(str(prefix)+",")
for opcode in opcodescount:
file1.write(str(opcode)+",")
for register in registerscount:
file1.write(str(register)+",")
for key in keywordcount:
file1.write(str(key)+",")
file1.write("\n")
file1.close()
# same as smallprocess() functions
def thirdprocess():
prefixes = ['HEADER:','.text:','.Pav:','.idata:','.data:','.bss:','.rdata:','.edata:','.rsrc:','.tls:','.reloc:','.BSS:','.CODE']
opcodes = ['jmp', 'mov', 'retf', 'push', 'pop', 'xor', 'retn', 'nop', 'sub', 'inc', 'dec', 'add','imul', 'xchg', 'or', 'shr', 'cmp', 'call', 'shl', 'ror', 'rol', 'jnb','jz','rtn','lea','movzx']
keywords = ['.dll','std::',':dword']
registers=['edx','esi','eax','ebx','ecx','edi','ebp','esp','eip']
file1=open("output\largeasmfile.txt","w+")
files = os.listdir('third')
for f in files:
prefixescount=np.zeros(len(prefixes),dtype=int)
opcodescount=np.zeros(len(opcodes),dtype=int)
keywordcount=np.zeros(len(keywords),dtype=int)
registerscount=np.zeros(len(registers),dtype=int)
features=[]
f2=f.split('.')[0]
file1.write(f2+",")
opcodefile.write(f2+" ")
with codecs.open('third/'+f,encoding='cp1252',errors ='replace') as fli:
for lines in fli:
line=lines.rstrip().split()
l=line[0]
for i in range(len(prefixes)):
if prefixes[i] in line[0]:
prefixescount[i]+=1
line=line[1:]
for i in range(len(opcodes)):
if any(opcodes[i]==li for li in line):
features.append(opcodes[i])
opcodescount[i]+=1
for i in range(len(registers)):
for li in line:
if registers[i] in li and ('text' in l or 'CODE' in l):
registerscount[i]+=1
for i in range(len(keywords)):
for li in line:
if keywords[i] in li:
keywordcount[i]+=1
for prefix in prefixescount:
file1.write(str(prefix)+",")
for opcode in opcodescount:
file1.write(str(opcode)+",")
for register in registerscount:
file1.write(str(register)+",")
for key in keywordcount:
file1.write(str(key)+",")
file1.write("\n")
file1.close()
def fourthprocess():
prefixes = ['HEADER:','.text:','.Pav:','.idata:','.data:','.bss:','.rdata:','.edata:','.rsrc:','.tls:','.reloc:','.BSS:','.CODE']
opcodes = ['jmp', 'mov', 'retf', 'push', 'pop', 'xor', 'retn', 'nop', 'sub', 'inc', 'dec', 'add','imul', 'xchg', 'or', 'shr', 'cmp', 'call', 'shl', 'ror', 'rol', 'jnb','jz','rtn','lea','movzx']
keywords = ['.dll','std::',':dword']
registers=['edx','esi','eax','ebx','ecx','edi','ebp','esp','eip']
file1=open("output\hugeasmfile.txt","w+")
files = os.listdir('fourth/')
for f in files:
prefixescount=np.zeros(len(prefixes),dtype=int)
opcodescount=np.zeros(len(opcodes),dtype=int)
keywordcount=np.zeros(len(keywords),dtype=int)
registerscount=np.zeros(len(registers),dtype=int)
features=[]
f2=f.split('.')[0]
file1.write(f2+",")
opcodefile.write(f2+" ")
with codecs.open('fourth/'+f,encoding='cp1252',errors ='replace') as fli:
for lines in fli:
line=lines.rstrip().split()
l=line[0]
for i in range(len(prefixes)):
if prefixes[i] in line[0]:
prefixescount[i]+=1
line=line[1:]
for i in range(len(opcodes)):
if any(opcodes[i]==li for li in line):
features.append(opcodes[i])
opcodescount[i]+=1
for i in range(len(registers)):
for li in line:
if registers[i] in li and ('text' in l or 'CODE' in l):
registerscount[i]+=1
for i in range(len(keywords)):
for li in line:
if keywords[i] in li:
keywordcount[i]+=1
for prefix in prefixescount:
file1.write(str(prefix)+",")
for opcode in opcodescount:
file1.write(str(opcode)+",")
for register in registerscount:
file1.write(str(register)+",")
for key in keywordcount:
file1.write(str(key)+",")
file1.write("\n")
file1.close()
def fifthprocess():
prefixes = ['HEADER:','.text:','.Pav:','.idata:','.data:','.bss:','.rdata:','.edata:','.rsrc:','.tls:','.reloc:','.BSS:','.CODE']
opcodes = ['jmp', 'mov', 'retf', 'push', 'pop', 'xor', 'retn', 'nop', 'sub', 'inc', 'dec', 'add','imul', 'xchg', 'or', 'shr', 'cmp', 'call', 'shl', 'ror', 'rol', 'jnb','jz','rtn','lea','movzx']
keywords = ['.dll','std::',':dword']
registers=['edx','esi','eax','ebx','ecx','edi','ebp','esp','eip']
file1=open("output\trainasmfile.txt","w+")
files = os.listdir('fifth/')
for f in files:
prefixescount=np.zeros(len(prefixes),dtype=int)
opcodescount=np.zeros(len(opcodes),dtype=int)
keywordcount=np.zeros(len(keywords),dtype=int)
registerscount=np.zeros(len(registers),dtype=int)
features=[]
f2=f.split('.')[0]
file1.write(f2+",")
opcodefile.write(f2+" ")
with codecs.open('fifth/'+f,encoding='cp1252',errors ='replace') as fli:
for lines in fli:
line=lines.rstrip().split()
l=line[0]
for i in range(len(prefixes)):
if prefixes[i] in line[0]:
prefixescount[i]+=1
line=line[1:]
for i in range(len(opcodes)):
if any(opcodes[i]==li for li in line):
features.append(opcodes[i])
opcodescount[i]+=1
for i in range(len(registers)):
for li in line:
if registers[i] in li and ('text' in l or 'CODE' in l):
registerscount[i]+=1
for i in range(len(keywords)):
for li in line:
if keywords[i] in li:
keywordcount[i]+=1
for prefix in prefixescount:
file1.write(str(prefix)+",")
for opcode in opcodescount:
file1.write(str(opcode)+",")
for register in registerscount:
file1.write(str(register)+",")
for key in keywordcount:
file1.write(str(key)+",")
file1.write("\n")
file1.close()
def main():
#the below code is used for multiprogramming
#the number of process depends upon the number of cores present System
#process is used to call multiprogramming
manager=multiprocessing.Manager()
p1=Process(target=firstprocess)
p2=Process(target=secondprocess)
p3=Process(target=thirdprocess)
p4=Process(target=fourthprocess)
p5=Process(target=fifthprocess)
#p1.start() is used to start the thread execution
p1.start()
p2.start()
p3.start()
p4.start()
p5.start()
#After completion all the threads are joined
p1.join()
p2.join()
p3.join()
p4.join()
p5.join()
if __name__=="__main__":
if False:
main()
# asmoutputfile.csv(output genarated from the above two cells) will contain all the extracted features from .asm files
# this file will be uploaded in the drive, you can directly use this
dfasm=pd.read_csv("asmoutputfile.csv")
Y.columns = ['ID', 'Class']
result_asm = pd.merge(dfasm, Y,on='ID', how='left')
result_asm.head()
| ID | HEADER: | .text: | .Pav: | .idata: | .data: | .bss: | .rdata: | .edata: | .rsrc: | ... | edx | esi | eax | ebx | ecx | edi | ebp | esp | eip | Class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 01kcPWA9K2BOxQeS5Rju | 19 | 744 | 0 | 127 | 57 | 0 | 323 | 0 | 3 | ... | 18 | 66 | 15 | 43 | 83 | 0 | 17 | 48 | 29 | 1 |
| 1 | 1E93CpP60RHFNiT5Qfvn | 17 | 838 | 0 | 103 | 49 | 0 | 0 | 0 | 3 | ... | 18 | 29 | 48 | 82 | 12 | 0 | 14 | 0 | 20 | 1 |
| 2 | 3ekVow2ajZHbTnBcsDfX | 17 | 427 | 0 | 50 | 43 | 0 | 145 | 0 | 3 | ... | 13 | 42 | 10 | 67 | 14 | 0 | 11 | 0 | 9 | 1 |
| 3 | 3X2nY7iQaPBIWDrAZqJe | 17 | 227 | 0 | 43 | 19 | 0 | 0 | 0 | 3 | ... | 6 | 8 | 14 | 7 | 2 | 0 | 8 | 0 | 6 | 1 |
| 4 | 46OZzdsSKDCFV8h7XWxf | 17 | 402 | 0 | 59 | 170 | 0 | 0 | 0 | 3 | ... | 12 | 9 | 18 | 29 | 5 | 0 | 11 | 0 | 11 | 1 |
5 rows × 53 columns
if not os.path.exists('asm_size_byte.csv'):
#file sizes of byte files
files=os.listdir('asmFiles')
filenames=Y['ID'].tolist()
class_y=Y['Class'].tolist()
class_bytes=[]
sizebytes=[]
fnames=[]
for file in files:
# print(os.stat('byteFiles/0A32eTdBKayjCWhZqDOQ.txt'))
# os.stat_result(st_mode=33206, st_ino=1125899906874507, st_dev=3561571700, st_nlink=1, st_uid=0, st_gid=0,
# st_size=3680109, st_atime=1519638522, st_mtime=1519638522, st_ctime=1519638522)
# read more about os.stat: here https://www.tutorialspoint.com/python/os_stat.htm
statinfo=os.stat('asmFiles/'+file)
# split the file name at '.' and take the first part of it i.e the file name
file=file.split('.')[0]
if any(file == filename for filename in filenames):
i=filenames.index(file)
class_bytes.append(class_y[i])
# converting into Mb's
sizebytes.append(statinfo.st_size/(1024.0*1024.0))
fnames.append(file)
asm_size_byte=pd.DataFrame({'ID':fnames,'size':sizebytes,'Class':class_bytes})
asm_size_byte.to_csv('asm_size_byte.csv', index = False)
else:
asm_size_byte = pd.read_csv('asm_size_byte.csv')
print(asm_size_byte.head())
ID size Class 0 01azqd4InC7m9JpocGv5 56.229886 9 1 01IsoiSMh5gxyDYTl4CB 13.999378 2 2 01jsnpXSAlgw6aPeDxrU 8.507785 9 3 01kcPWA9K2BOxQeS5Rju 0.078190 1 4 01SuzwMJEIXsK7A8dQbl 0.996723 8
#boxplot of asm files
ax = sns.boxplot(x="Class", y="size", data=asm_size_byte)
plt.title("boxplot of .bytes file sizes")
plt.show()
# add the file size feature to previous extracted features
print(result_asm.shape)
print(asm_size_byte.shape)
result_asm = pd.merge(result_asm, asm_size_byte.drop(['Class'], axis=1),on='ID', how='left')
result_asm.head()
(10868, 53) (10868, 3)
| ID | HEADER: | .text: | .Pav: | .idata: | .data: | .bss: | .rdata: | .edata: | .rsrc: | ... | esi | eax | ebx | ecx | edi | ebp | esp | eip | Class | size | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 01kcPWA9K2BOxQeS5Rju | 19 | 744 | 0 | 127 | 57 | 0 | 323 | 0 | 3 | ... | 66 | 15 | 43 | 83 | 0 | 17 | 48 | 29 | 1 | 0.078190 |
| 1 | 1E93CpP60RHFNiT5Qfvn | 17 | 838 | 0 | 103 | 49 | 0 | 0 | 0 | 3 | ... | 29 | 48 | 82 | 12 | 0 | 14 | 0 | 20 | 1 | 0.063400 |
| 2 | 3ekVow2ajZHbTnBcsDfX | 17 | 427 | 0 | 50 | 43 | 0 | 145 | 0 | 3 | ... | 42 | 10 | 67 | 14 | 0 | 11 | 0 | 9 | 1 | 0.041695 |
| 3 | 3X2nY7iQaPBIWDrAZqJe | 17 | 227 | 0 | 43 | 19 | 0 | 0 | 0 | 3 | ... | 8 | 14 | 7 | 2 | 0 | 8 | 0 | 6 | 1 | 0.018757 |
| 4 | 46OZzdsSKDCFV8h7XWxf | 17 | 402 | 0 | 59 | 170 | 0 | 0 | 0 | 3 | ... | 9 | 18 | 29 | 5 | 0 | 11 | 0 | 11 | 1 | 0.037567 |
5 rows × 54 columns
# we normalize the data each column
result_asm = normalize(result_asm)
result_asm.head()
| ID | HEADER: | .text: | .Pav: | .idata: | .data: | .bss: | .rdata: | .edata: | .rsrc: | ... | esi | eax | ebx | ecx | edi | ebp | esp | eip | Class | size | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 01kcPWA9K2BOxQeS5Rju | 0.107345 | 0.001092 | 0.0 | 0.000761 | 0.000023 | 0.0 | 0.000084 | 0.0 | 0.000072 | ... | 0.000746 | 0.000301 | 0.000360 | 0.001057 | 0.0 | 0.030797 | 0.001468 | 0.003173 | 1 | 0.000432 |
| 1 | 1E93CpP60RHFNiT5Qfvn | 0.096045 | 0.001230 | 0.0 | 0.000617 | 0.000019 | 0.0 | 0.000000 | 0.0 | 0.000072 | ... | 0.000328 | 0.000965 | 0.000686 | 0.000153 | 0.0 | 0.025362 | 0.000000 | 0.002188 | 1 | 0.000327 |
| 2 | 3ekVow2ajZHbTnBcsDfX | 0.096045 | 0.000627 | 0.0 | 0.000300 | 0.000017 | 0.0 | 0.000038 | 0.0 | 0.000072 | ... | 0.000475 | 0.000201 | 0.000560 | 0.000178 | 0.0 | 0.019928 | 0.000000 | 0.000985 | 1 | 0.000172 |
| 3 | 3X2nY7iQaPBIWDrAZqJe | 0.096045 | 0.000333 | 0.0 | 0.000258 | 0.000008 | 0.0 | 0.000000 | 0.0 | 0.000072 | ... | 0.000090 | 0.000281 | 0.000059 | 0.000025 | 0.0 | 0.014493 | 0.000000 | 0.000657 | 1 | 0.000009 |
| 4 | 46OZzdsSKDCFV8h7XWxf | 0.096045 | 0.000590 | 0.0 | 0.000353 | 0.000068 | 0.0 | 0.000000 | 0.0 | 0.000072 | ... | 0.000102 | 0.000362 | 0.000243 | 0.000064 | 0.0 | 0.019928 | 0.000000 | 0.001204 | 1 | 0.000143 |
5 rows × 54 columns
ax = sns.boxplot(x="Class", y=".text:", data=result_asm)
plt.title("boxplot of .asm text segment")
plt.show()
The plot is between Text and class Class 1,2 and 9 can be easly separated
ax = sns.boxplot(x="Class", y=".Pav:", data=result_asm)
plt.title("boxplot of .asm pav segment")
plt.show()
ax = sns.boxplot(x="Class", y=".data:", data=result_asm)
plt.title("boxplot of .asm data segment")
plt.show()
The plot is between data segment and class label class 6 and class 9 can be easily separated from given points
ax = sns.boxplot(x="Class", y=".bss:", data=result_asm)
plt.title("boxplot of .asm bss segment")
plt.show()
plot between bss segment and class label very less number of files are having bss segment
ax = sns.boxplot(x="Class", y=".rdata:", data=result_asm)
plt.title("boxplot of .asm rdata segment")
plt.show()
Plot between rdata segment and Class segment Class 2 can be easily separated 75 pecentile files are having 1M rdata lines
ax = sns.boxplot(x="Class", y="jmp", data=result_asm)
plt.title("boxplot of .asm jmp opcode")
plt.show()
plot between jmp and Class label Class 1 is having frequency of 2000 approx in 75 perentile of files
ax = sns.boxplot(x="Class", y="mov", data=result_asm)
plt.title("boxplot of .asm mov opcode")
plt.show()
plot between Class label and mov opcode Class 1 is having frequency of 2000 approx in 75 perentile of files
ax = sns.boxplot(x="Class", y="retf", data=result_asm)
plt.title("boxplot of .asm retf opcode")
plt.show()
plot between Class label and retf Class 6 can be easily separated with opcode retf The frequency of retf is approx of 250.
ax = sns.boxplot(x="Class", y="push", data=result_asm)
plt.title("boxplot of .asm push opcode")
plt.show()
plot between push opcode and Class label Class 1 is having 75 precentile files with push opcodes of frequency 1000
# check out the course content for more explantion on tsne algorithm
# https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/t-distributed-stochastic-neighbourhood-embeddingt-sne-part-1/
#multivariate analysis on byte files
#this is with perplexity 50
xtsne=TSNE(perplexity=50)
results=xtsne.fit_transform(result_asm.drop(['ID','Class'], axis=1).fillna(0))
vis_x = results[:, 0]
vis_y = results[:, 1 ]
plt.scatter(vis_x, vis_y, c=data_y, cmap=plt.cm.get_cmap("jet", 9))
plt.colorbar(ticks=range(10))
plt.clim(0.5, 9)
plt.show()
# by univariate analysis on the .asm file features we are getting very negligible information from
# 'rtn', '.BSS:' '.CODE' features, so heare we are trying multivariate analysis after removing those features
# the plot looks very messy
xtsne=TSNE(perplexity=30)
results=xtsne.fit_transform(result_asm.drop(['ID','Class', 'rtn', '.BSS:', '.CODE','size'], axis=1))
vis_x = results[:, 0]
vis_y = results[:, 1]
plt.scatter(vis_x, vis_y, c=data_y, cmap=plt.cm.get_cmap("jet", 9))
plt.colorbar(ticks=range(10))
plt.clim(0.5, 9)
plt.show()
TSNE for asm data with perplexity 50
asm_y = result_asm['Class']
asm_x = result_asm.drop(['ID','Class','.BSS:','rtn','.CODE'], axis=1)
X_train_asm, X_test_asm, y_train_asm, y_test_asm = train_test_split(asm_x,asm_y ,stratify=asm_y,test_size=0.20)
X_train_asm, X_cv_asm, y_train_asm, y_cv_asm = train_test_split(X_train_asm, y_train_asm,stratify=y_train_asm,test_size=0.20)
print( X_cv_asm.isnull().all())
HEADER: False .text: False .Pav: False .idata: False .data: False .bss: False .rdata: False .edata: False .rsrc: False .tls: False .reloc: False jmp False mov False retf False push False pop False xor False retn False nop False sub False inc False dec False add False imul False xchg False or False shr False cmp False call False shl False ror False rol False jnb False jz False lea False movzx False .dll False std:: False :dword False edx False esi False eax False ebx False ecx False edi False ebp False esp False eip False size False dtype: bool
# find more about KNeighborsClassifier() here http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
# -------------------------
# default parameter
# KNeighborsClassifier(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2,
# metric=’minkowski’, metric_params=None, n_jobs=1, **kwargs)
# methods of
# fit(X, y) : Fit the model using X as training data and y as target values
# predict(X):Predict the class labels for the provided data
# predict_proba(X):Return probability estimates for the test data X.
#-------------------------------------
# video link: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/k-nearest-neighbors-geometric-intuition-with-a-toy-example-1/
#-------------------------------------
# find more about CalibratedClassifierCV here at http://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html
# ----------------------------
# default paramters
# sklearn.calibration.CalibratedClassifierCV(base_estimator=None, method=’sigmoid’, cv=3)
#
# some of the methods of CalibratedClassifierCV()
# fit(X, y[, sample_weight]) Fit the calibrated model
# get_params([deep]) Get parameters for this estimator.
# predict(X) Predict the target of new samples.
# predict_proba(X) Posterior probabilities of classification
#-------------------------------------
# video link:
#-------------------------------------
if not os.path.exists('models/uni_asm_knn.sav'):
alpha = [x for x in range(1, 21,2)]
cv_log_error_array=[]
for i in alpha:
k_cfl=KNeighborsClassifier(n_neighbors=i)
k_cfl.fit(X_train_asm,y_train_asm)
sig_clf = CalibratedClassifierCV(k_cfl, method="sigmoid")
sig_clf.fit(X_train_asm, y_train_asm)
predict_y = sig_clf.predict_proba(X_cv_asm)
cv_log_error_array.append(log_loss(y_cv_asm, predict_y, labels=k_cfl.classes_, eps=1e-15))
for i in range(len(cv_log_error_array)):
print ('log_loss for k = ',alpha[i],'is',cv_log_error_array[i])
best_alpha = np.argmin(cv_log_error_array)
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
k_cfl=KNeighborsClassifier(n_neighbors=alpha[best_alpha])
k_cfl.fit(X_train_asm,y_train_asm)
sig_clf = CalibratedClassifierCV(k_cfl, method="sigmoid")
sig_clf.fit(X_train_asm, y_train_asm)
pred_y=sig_clf.predict(X_test_asm)
# save the model to disk
pickle.dump(sig_clf, open('models/uni_asm_knn.sav', 'wb'))
else:
# load the model from disk
sig_clf = pickle.load(open('models/uni_asm_knn.sav', 'rb'))
print(sig_clf)
predict_y = sig_clf.predict_proba(X_train_asm)
print ('log loss for train data',log_loss(y_train_asm, predict_y))
predict_y = sig_clf.predict_proba(X_cv_asm)
print ('log loss for cv data',log_loss(y_cv_asm, predict_y))
predict_y = sig_clf.predict_proba(X_test_asm)
print ('log loss for test data',log_loss(y_test_asm, predict_y))
plot_confusion_matrix(y_test_asm,sig_clf.predict(X_test_asm))
CalibratedClassifierCV(base_estimator=KNeighborsClassifier(n_neighbors=1)) log loss for train data 0.045166089148031885 log loss for cv data 0.05145287366236347 log loss for test data 0.04941708388670314 Number of misclassified points 0.5059797608095675 -------------------------------------------------- Confusion matrix --------------------------------------------------
-------------------------------------------------- Precision matrix --------------------------------------------------
Sum of columns in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.] -------------------------------------------------- Recall matrix --------------------------------------------------
Sum of rows in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.]
# read more about SGDClassifier() at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
# ------------------------------
# default parameters
# SGDClassifier(loss=’hinge’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=None, tol=None,
# shuffle=True, verbose=0, epsilon=0.1, n_jobs=1, random_state=None, learning_rate=’optimal’, eta0=0.0, power_t=0.5,
# class_weight=None, warm_start=False, average=False, n_iter=None)
# some of methods
# fit(X, y[, coef_init, intercept_init, …]) Fit linear model with Stochastic Gradient Descent.
# predict(X) Predict class labels for samples in X.
#-------------------------------
# video link: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/geometric-intuition-1/
#------------------------------
if not os.path.exists('models/uni_asm_lr.sav'):
alpha = [10 ** x for x in range(-5, 4)]
cv_log_error_array=[]
for i in alpha:
logisticR=LogisticRegression(penalty='l2',C=i,class_weight='balanced')
logisticR.fit(X_train_asm,y_train_asm)
sig_clf = CalibratedClassifierCV(logisticR, method="sigmoid")
sig_clf.fit(X_train_asm, y_train_asm)
predict_y = sig_clf.predict_proba(X_cv_asm)
cv_log_error_array.append(log_loss(y_cv_asm, predict_y, labels=logisticR.classes_, eps=1e-15))
for i in range(len(cv_log_error_array)):
print ('log_loss for c = ',alpha[i],'is',cv_log_error_array[i])
best_alpha = np.argmin(cv_log_error_array)
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
logisticR=LogisticRegression(penalty='l2',C=alpha[best_alpha],class_weight='balanced')
logisticR.fit(X_train_asm,y_train_asm)
sig_clf = CalibratedClassifierCV(logisticR, method="sigmoid")
sig_clf.fit(X_train_asm, y_train_asm)
# save the model to disk
pickle.dump(sig_clf, open('models/uni_asm_lr.sav', 'wb'))
else:
# load the model from disk
sig_clf = pickle.load(open('models/uni_asm_lr.sav', 'rb'))
print(sig_clf)
predict_y = sig_clf.predict_proba(X_train_asm)
print ('log loss for train data',(log_loss(y_train_asm, predict_y, labels=sig_clf.base_estimator.classes_, eps=1e-15)))
predict_y = sig_clf.predict_proba(X_cv_asm)
print ('log loss for cv data',(log_loss(y_cv_asm, predict_y, labels=sig_clf.base_estimator.classes_, eps=1e-15)))
predict_y = sig_clf.predict_proba(X_test_asm)
print ('log loss for test data',(log_loss(y_test_asm, predict_y, labels=sig_clf.base_estimator.classes_, eps=1e-15)))
plot_confusion_matrix(y_test_asm,sig_clf.predict(X_test_asm))
CalibratedClassifierCV(base_estimator=LogisticRegression(C=0.1,
class_weight='balanced'))
log loss for train data 1.0089122184149941
log loss for cv data 0.9949338226436782
log loss for test data 1.015909628677529
Number of misclassified points 29.20883164673413
-------------------------------------------------- Confusion matrix --------------------------------------------------
-------------------------------------------------- Precision matrix --------------------------------------------------
Sum of columns in precision matrix [ 1. 1. 1. 1. nan 1. 1. 1. 1.] -------------------------------------------------- Recall matrix --------------------------------------------------
Sum of rows in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.]
# --------------------------------
# default parameters
# sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion=’gini’, max_depth=None, min_samples_split=2,
# min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0,
# min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False,
# class_weight=None)
# Some of methods of RandomForestClassifier()
# fit(X, y, [sample_weight]) Fit the SVM model according to the given training data.
# predict(X) Perform classification on samples in X.
# predict_proba (X) Perform classification on samples in X.
# some of attributes of RandomForestClassifier()
# feature_importances_ : array of shape = [n_features]
# The feature importances (the higher, the more important the feature).
# --------------------------------
# video link: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/random-forest-and-their-construction-2/
# --------------------------------
if not os.path.exists('models/uni_asm_rf.sav'):
alpha=[10,50,100,500,1000,2000,3000]
cv_log_error_array=[]
for i in alpha:
r_cfl=RandomForestClassifier(n_estimators=i,random_state=42,n_jobs=-1)
r_cfl.fit(X_train_asm,y_train_asm)
sig_clf = CalibratedClassifierCV(r_cfl, method="sigmoid")
sig_clf.fit(X_train_asm, y_train_asm)
predict_y = sig_clf.predict_proba(X_cv_asm)
cv_log_error_array.append(log_loss(y_cv_asm, predict_y, labels=r_cfl.classes_, eps=1e-15))
for i in range(len(cv_log_error_array)):
print ('log_loss for c = ',alpha[i],'is',cv_log_error_array[i])
best_alpha = np.argmin(cv_log_error_array)
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
r_cfl=RandomForestClassifier(n_estimators=alpha[best_alpha],random_state=42,n_jobs=-1)
r_cfl.fit(X_train_asm,y_train_asm)
sig_clf = CalibratedClassifierCV(r_cfl, method="sigmoid")
sig_clf.fit(X_train_asm, y_train_asm)
predict_y = sig_clf.predict_proba(X_train_asm)
# save the model to disk
pickle.dump(sig_clf, open('models/uni_asm_rf.sav', 'wb'))
else:
# load the model from disk
sig_clf = pickle.load(open('models/uni_asm_rf.sav', 'rb'))
print(sig_clf)
predict_y = sig_clf.predict_proba(X_train_asm)
print ('log loss for train data',(log_loss(y_train_asm, predict_y, labels=sig_clf.base_estimator.classes_, eps=1e-15)))
predict_y = sig_clf.predict_proba(X_cv_asm)
print ('log loss for cv data',(log_loss(y_cv_asm, predict_y, labels=sig_clf.base_estimator.classes_, eps=1e-15)))
predict_y = sig_clf.predict_proba(X_test_asm)
print ('log loss for test data',(log_loss(y_test_asm, predict_y, labels=sig_clf.base_estimator.classes_, eps=1e-15)))
plot_confusion_matrix(y_test_asm,sig_clf.predict(X_test_asm))
CalibratedClassifierCV(base_estimator=RandomForestClassifier(n_estimators=500,
n_jobs=-1,
random_state=42))
log loss for train data 0.021391205046933592
log loss for cv data 0.022270653435918275
log loss for test data 0.02068634283445902
Number of misclassified points 0.22999080036798528
-------------------------------------------------- Confusion matrix --------------------------------------------------
-------------------------------------------------- Precision matrix --------------------------------------------------
Sum of columns in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.] -------------------------------------------------- Recall matrix --------------------------------------------------
Sum of rows in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.]
# Training a hyper-parameter tuned Xg-Boost regressor on our train data
# find more about XGBClassifier function here http://xgboost.readthedocs.io/en/latest/python/python_api.html?#xgboost.XGBClassifier
# -------------------------
# default paramters
# class xgboost.XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True,
# objective='binary:logistic', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1,
# max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1,
# scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)
# some of methods of RandomForestRegressor()
# fit(X, y, sample_weight=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None)
# get_params([deep]) Get parameters for this estimator.
# predict(data, output_margin=False, ntree_limit=0) : Predict with data. NOTE: This function is not thread safe.
# get_score(importance_type='weight') -> get the feature importance
# -----------------------
# video link2: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/what-are-ensembles/
# -----------------------
if not os.path.exists('models/uni_asm_xgb.sav'):
alpha=[10,50,100,500,1000,2000,3000]
cv_log_error_array=[]
for i in alpha:
x_cfl=XGBClassifier(n_estimators=i,nthread=-1)
x_cfl.fit(X_train_asm,y_train_asm)
sig_clf = CalibratedClassifierCV(x_cfl, method="sigmoid")
sig_clf.fit(X_train_asm, y_train_asm)
predict_y = sig_clf.predict_proba(X_cv_asm)
cv_log_error_array.append(log_loss(y_cv_asm, predict_y, labels=x_cfl.classes_, eps=1e-15))
for i in range(len(cv_log_error_array)):
print ('log_loss for c = ',alpha[i],'is',cv_log_error_array[i])
best_alpha = np.argmin(cv_log_error_array)
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
x_cfl=XGBClassifier(n_estimators=alpha[best_alpha],nthread=-1)
x_cfl.fit(X_train_asm,y_train_asm)
sig_clf = CalibratedClassifierCV(x_cfl, method="sigmoid")
sig_clf.fit(X_train_asm, y_train_asm)
# save the model to disk
pickle.dump(sig_clf, open('models/uni_asm_xgb.sav', 'wb'))
else:
# load the model from disk
sig_clf = pickle.load(open('models/uni_asm_xgb.sav', 'rb'))
print(sig_clf)
predict_y = sig_clf.predict_proba(X_train_asm)
print ('For values of best alpha = ', sig_clf.base_estimator.n_estimators, "The train log loss is:",log_loss(y_train_asm, predict_y))
predict_y = sig_clf.predict_proba(X_cv_asm)
print('For values of best alpha = ', sig_clf.base_estimator.n_estimators, "The cross validation log loss is:",log_loss(y_cv_asm, predict_y))
predict_y = sig_clf.predict_proba(X_test_asm)
print('For values of best alpha = ', sig_clf.base_estimator.n_estimators, "The test log loss is:",log_loss(y_test_asm, predict_y))
plot_confusion_matrix(y_test_asm,sig_clf.predict(X_test_asm))
CalibratedClassifierCV(base_estimator=XGBClassifier(base_score=0.5,
booster='gbtree',
colsample_bylevel=1,
colsample_bynode=1,
colsample_bytree=1,
enable_categorical=False,
gamma=0, gpu_id=-1,
importance_type=None,
interaction_constraints='',
learning_rate=0.300000012,
max_delta_step=0,
max_depth=6,
min_child_weight=1,
missing=nan,
monotone_constraints='()',
n_estimators=3000,
n_jobs=16, nthread=-1,
num_parallel_tree=1,
objective='multi:softprob',
predictor='auto',
random_state=0, reg_alpha=0,
reg_lambda=1,
scale_pos_weight=None,
subsample=1,
tree_method='exact',
validate_parameters=1,
verbosity=None))
For values of best alpha = 3000 The train log loss is: 0.017821475784672216
For values of best alpha = 3000 The cross validation log loss is: 0.016268589645363533
For values of best alpha = 3000 The test log loss is: 0.016268506240911897
Number of misclassified points 0.09199632014719411
-------------------------------------------------- Confusion matrix --------------------------------------------------
-------------------------------------------------- Precision matrix --------------------------------------------------
Sum of columns in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.] -------------------------------------------------- Recall matrix --------------------------------------------------
Sum of rows in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.]
if False:
x_cfl=XGBClassifier()
prams={
'learning_rate':[0.01,0.03,0.05,0.1,0.15,0.2],
'n_estimators':[100,200,500,1000,2000],
'max_depth':[3,5,10],
'colsample_bytree':[0.1,0.3,0.5,1],
'subsample':[0.1,0.3,0.5,1]
}
random_cfl=RandomizedSearchCV(x_cfl,param_distributions=prams,verbose=10,n_jobs=-1,)
random_cfl.fit(X_train_asm,y_train_asm)
if False:
print (random_cfl.best_params_)
# Training a hyper-parameter tuned Xg-Boost regressor on our train data
# find more about XGBClassifier function here http://xgboost.readthedocs.io/en/latest/python/python_api.html?#xgboost.XGBClassifier
# -------------------------
# default paramters
# class xgboost.XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True,
# objective='binary:logistic', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1,
# max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1,
# scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)
# some of methods of RandomForestRegressor()
# fit(X, y, sample_weight=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None)
# get_params([deep]) Get parameters for this estimator.
# predict(data, output_margin=False, ntree_limit=0) : Predict with data. NOTE: This function is not thread safe.
# get_score(importance_type='weight') -> get the feature importance
# -----------------------
# video link2: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/what-are-ensembles/
# -----------------------
x_cfl=XGBClassifier(n_estimators=2000,subsample=0.5,learning_rate=0.05,colsample_bytree=0.3,max_depth=5)
x_cfl.fit(X_train_asm,y_train_asm)
c_cfl=CalibratedClassifierCV(x_cfl,method='sigmoid')
c_cfl.fit(X_train_asm,y_train_asm)
predict_y = c_cfl.predict_proba(X_train_asm)
print ('train loss',log_loss(y_train_asm, predict_y))
predict_y = c_cfl.predict_proba(X_cv_asm)
print ('cv loss',log_loss(y_cv_asm, predict_y))
predict_y = c_cfl.predict_proba(X_test_asm)
print ('test loss',log_loss(y_test_asm, predict_y))
[06:09:51] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [06:10:06] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [06:10:18] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [06:10:30] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [06:10:47] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [06:11:01] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. train loss 0.013169849274296428 cv loss 0.025238055717256534 test loss 0.037825210778105696
import numpy as np
import pandas as pd
import codecs
import imageio
import array
from datetime import datetime
import os
from tqdm.notebook import tqdm
if not os.path.exists('pixel_asm.csv'):
cols = ['ID']
cols.extend([i for i in range(800)])
asmfile_list = os.listdir("asmFiles")
pixel_feat = pd.DataFrame(columns = cols)
for file_name in tqdm(asmfile_list):
file = codecs.open("asmFiles/"+file_name, 'rb')
size_of_current_asm_file = os.path.getsize("asmFiles/"+file_name)
array_of_image = array.array('B')
array_of_image.fromfile(file, size_of_current_asm_file)
file.close()
arr_of_generated_image = np.reshape(array_of_image[:800], 800)
arr_of_generated_image = np.uint8(arr_of_generated_image)
temp = [file_name.split('.')[0]]
temp.extend(list(arr_of_generated_image))
temp_ser = pd.Series(temp, index = pixel_feat.columns)
pixel_feat = pixel_feat.append(temp_ser, ignore_index=True)
pixel_feat.to_csv('pixel_asm.csv', index = False)
else:
pixel_asm = pd.read_csv('pixel_asm.csv')
display(pixel_asm.head())
| ID | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | 790 | 791 | 792 | 793 | 794 | 795 | 796 | 797 | 798 | 799 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 01azqd4InC7m9JpocGv5 | 72 | 69 | 65 | 68 | 69 | 82 | 58 | 48 | 48 | ... | 61 | 61 | 61 | 61 | 61 | 61 | 61 | 61 | 61 | 61 |
| 1 | 01IsoiSMh5gxyDYTl4CB | 46 | 116 | 101 | 120 | 116 | 58 | 48 | 48 | 52 | ... | 56 | 54 | 32 | 40 | 80 | 69 | 41 | 13 | 10 | 46 |
| 2 | 01jsnpXSAlgw6aPeDxrU | 72 | 69 | 65 | 68 | 69 | 82 | 58 | 48 | 48 | ... | 61 | 61 | 61 | 61 | 61 | 61 | 61 | 61 | 61 | 61 |
| 3 | 01kcPWA9K2BOxQeS5Rju | 72 | 69 | 65 | 68 | 69 | 82 | 58 | 49 | 48 | ... | 109 | 111 | 100 | 101 | 108 | 32 | 102 | 108 | 97 | 116 |
| 4 | 01SuzwMJEIXsK7A8dQbl | 72 | 69 | 65 | 68 | 69 | 82 | 58 | 48 | 48 | ... | 61 | 61 | 61 | 61 | 61 | 61 | 61 | 61 | 61 | 61 |
5 rows × 801 columns
asm_size_byte = pd.read_csv('asm_size_byte.csv')
result_pixel_asm = pixel_asm.merge(asm_size_byte, on='ID', how='inner')
# we normalize the data each column
result_pixel_asm = normalize(result_pixel_asm)
result_pixel_asm.head()
| ID | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | 792 | 793 | 794 | 795 | 796 | 797 | 798 | 799 | size | Class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 01azqd4InC7m9JpocGv5 | 0.481928 | 0.302632 | 0.000000 | 0.277778 | 0.291667 | 0.500000 | 1.0 | 0.000000 | 0.000000 | ... | 0.481481 | 0.490566 | 0.525253 | 0.490566 | 0.490566 | 0.485981 | 0.460177 | 0.485981 | 0.400910 | 9 |
| 1 | 01IsoiSMh5gxyDYTl4CB | 0.168675 | 0.921053 | 0.705882 | 1.000000 | 0.944444 | 0.147059 | 0.0 | 0.000000 | 0.181818 | ... | 0.212963 | 0.292453 | 0.717172 | 0.566038 | 0.301887 | 0.037383 | 0.008850 | 0.345794 | 0.099719 | 2 |
| 2 | 01jsnpXSAlgw6aPeDxrU | 0.481928 | 0.302632 | 0.000000 | 0.277778 | 0.291667 | 0.500000 | 1.0 | 0.000000 | 0.000000 | ... | 0.481481 | 0.490566 | 0.525253 | 0.490566 | 0.490566 | 0.485981 | 0.460177 | 0.485981 | 0.060553 | 9 |
| 3 | 01kcPWA9K2BOxQeS5Rju | 0.481928 | 0.302632 | 0.000000 | 0.277778 | 0.291667 | 0.500000 | 1.0 | 0.058824 | 0.000000 | ... | 0.842593 | 0.867925 | 1.000000 | 0.216981 | 0.877358 | 0.925234 | 0.778761 | 1.000000 | 0.000432 | 1 |
| 4 | 01SuzwMJEIXsK7A8dQbl | 0.481928 | 0.302632 | 0.000000 | 0.277778 | 0.291667 | 0.500000 | 1.0 | 0.000000 | 0.000000 | ... | 0.481481 | 0.490566 | 0.525253 | 0.490566 | 0.490566 | 0.485981 | 0.460177 | 0.485981 | 0.006983 | 8 |
5 rows × 803 columns
plt.close()
# check out the course content for more explantion on tsne algorithm
# https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/t-distributed-stochastic
#multivariate analysis on byte files
#this is with perplexity 50
xtsne=TSNE(perplexity=50)
results=xtsne.fit_transform(result_pixel_asm.drop(['ID','Class'], axis=1).fillna(0))
vis_x = results[:, 0]
vis_y = results[:, 1 ]
plt.scatter(vis_x, vis_y, c=data_y, cmap=plt.cm.get_cmap("jet", 9))
plt.colorbar(ticks=range(10))
plt.clim(0.5, 9)
plt.show()
#this is with perplexity 30
xtsne=TSNE(perplexity=30)
results=xtsne.fit_transform(result_pixel_asm.drop(['ID','Class'], axis=1).fillna(0))
vis_x = results[:, 0]
vis_y = results[:, 1 ]
plt.scatter(vis_x, vis_y, c=data_y, cmap=plt.cm.get_cmap("jet", 9))
plt.colorbar(ticks=range(10))
plt.clim(0.5, 9)
plt.show()
a = result_pixel_asm.isnull().all()
a[a==True]
18 True 19 True dtype: bool
pixel_asm_y = result_pixel_asm['Class']
pixel_asm_x = result_pixel_asm.drop(['ID','Class','18','19'], axis=1)
X_train_pixel_asm, X_test_pixel_asm, y_train_pixel_asm, y_test_pixel_asm = train_test_split(pixel_asm_x,pixel_asm_y ,stratify=pixel_asm_y,test_size=0.20)
X_train_pixel_asm, X_cv_pixel_asm, y_train_pixel_asm, y_cv_pixel_asm = train_test_split(X_train_pixel_asm, y_train_pixel_asm,stratify=y_train_pixel_asm,test_size=0.20)
# find more about KNeighborsClassifier() here http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
# -------------------------
# default parameter
# KNeighborsClassifier(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2,
# metric=’minkowski’, metric_params=None, n_jobs=1, **kwargs)
# methods of
# fit(X, y) : Fit the model using X as training data and y as target values
# predict(X):Predict the class labels for the provided data
# predict_proba(X):Return probability estimates for the test data X.
#-------------------------------------
# video link: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/k-nearest-neighbors-geometric-intuition-with-a-toy-example-1/
#-------------------------------------
# find more about CalibratedClassifierCV here at http://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html
# ----------------------------
# default paramters
# sklearn.calibration.CalibratedClassifierCV(base_estimator=None, method=’sigmoid’, cv=3)
#
# some of the methods of CalibratedClassifierCV()
# fit(X, y[, sample_weight]) Fit the calibrated model
# get_params([deep]) Get parameters for this estimator.
# predict(X) Predict the target of new samples.
# predict_proba(X) Posterior probabilities of classification
#-------------------------------------
# video link:
#-------------------------------------
if not os.path.exists('models/uni_pixel_asm_knn.sav'):
alpha = [x for x in range(1, 21,2)]
cv_log_error_array=[]
for i in alpha:
k_cfl=KNeighborsClassifier(n_neighbors=i)
k_cfl.fit(X_train_pixel_asm,y_train_pixel_asm)
sig_clf = CalibratedClassifierCV(k_cfl, method="sigmoid")
sig_clf.fit(X_train_pixel_asm, y_train_pixel_asm)
predict_y = sig_clf.predict_proba(X_cv_pixel_asm)
cv_log_error_array.append(log_loss(y_cv_pixel_asm, predict_y, labels=k_cfl.classes_, eps=1e-15))
for i in range(len(cv_log_error_array)):
print ('log_loss for k = ',alpha[i],'is',cv_log_error_array[i])
best_alpha = np.argmin(cv_log_error_array)
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
k_cfl=KNeighborsClassifier(n_neighbors=alpha[best_alpha])
k_cfl.fit(X_train_pixel_asm,y_train_pixel_asm)
sig_clf = CalibratedClassifierCV(k_cfl, method="sigmoid")
sig_clf.fit(X_train_pixel_asm, y_train_pixel_asm)
pred_y=sig_clf.predict(X_test_pixel_asm)
# save the model to disk
pickle.dump(sig_clf, open('models/uni_pixel_asm_knn.sav', 'wb'))
else:
# load the model from disk
sig_clf = pickle.load(open('models/uni_pixel_asm_knn.sav', 'rb'))
print(sig_clf)
predict_y = sig_clf.predict_proba(X_train_pixel_asm)
print ('log loss for train data',log_loss(y_train_pixel_asm, predict_y))
predict_y = sig_clf.predict_proba(X_cv_pixel_asm)
print ('log loss for cv data',log_loss(y_cv_pixel_asm, predict_y))
predict_y = sig_clf.predict_proba(X_test_pixel_asm)
print ('log loss for test data',log_loss(y_test_pixel_asm, predict_y))
plot_confusion_matrix(y_test_pixel_asm,sig_clf.predict(X_test_pixel_asm))
log_loss for k = 1 is 0.2646642314458625 log_loss for k = 3 is 0.1987265638632809 log_loss for k = 5 is 0.19411944843751341 log_loss for k = 7 is 0.19687720431972555 log_loss for k = 9 is 0.20523764596511845 log_loss for k = 11 is 0.21369094736735714 log_loss for k = 13 is 0.22200008524996714 log_loss for k = 15 is 0.22751663890577326 log_loss for k = 17 is 0.23228044946391072 log_loss for k = 19 is 0.23940685916635937
log loss for train data 0.16407312132298535 log loss for cv data 0.19411944843751341 log loss for test data 0.1982553517183401 Number of misclassified points 82.65869365225392 -------------------------------------------------- Confusion matrix --------------------------------------------------
-------------------------------------------------- Precision matrix --------------------------------------------------
Sum of columns in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.] -------------------------------------------------- Recall matrix --------------------------------------------------
Sum of rows in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.]
# read more about SGDClassifier() at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
# ------------------------------
# default parameters
# SGDClassifier(loss=’hinge’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=None, tol=None,
# shuffle=True, verbose=0, epsilon=0.1, n_jobs=1, random_state=None, learning_rate=’optimal’, eta0=0.0, power_t=0.5,
# class_weight=None, warm_start=False, average=False, n_iter=None)
# some of methods
# fit(X, y[, coef_init, intercept_init, …]) Fit linear model with Stochastic Gradient Descent.
# predict(X) Predict class labels for samples in X.
#-------------------------------
# video link: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/geometric-intuition-1/
#------------------------------
if not os.path.exists('models/uni_pixel_asm_lr.sav'):
alpha = [10 ** x for x in range(-5, 4)]
cv_log_error_array=[]
for i in alpha:
logisticR=LogisticRegression(penalty='l2',C=i,class_weight='balanced')
logisticR.fit(X_train_pixel_asm,y_train_pixel_asm)
sig_clf = CalibratedClassifierCV(logisticR, method="sigmoid")
sig_clf.fit(X_train_pixel_asm, y_train_pixel_asm)
predict_y = sig_clf.predict_proba(X_cv_pixel_asm)
cv_log_error_array.append(log_loss(y_cv_pixel_asm, predict_y, labels=logisticR.classes_, eps=1e-15))
for i in range(len(cv_log_error_array)):
print ('log_loss for c = ',alpha[i],'is',cv_log_error_array[i])
best_alpha = np.argmin(cv_log_error_array)
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
logisticR=LogisticRegression(penalty='l2',C=alpha[best_alpha],class_weight='balanced')
logisticR.fit(X_train_pixel_asm,y_train_pixel_asm)
sig_clf = CalibratedClassifierCV(logisticR, method="sigmoid")
sig_clf.fit(X_train_pixel_asm, y_train_pixel_asm)
# save the model to disk
pickle.dump(sig_clf, open('models/uni_pixel_asm_lr.sav', 'wb'))
else:
# load the model from disk
sig_clf = pickle.load(open('models/uni_pixel_asm_lr.sav', 'rb'))
print(sig_clf)
predict_y = sig_clf.predict_proba(X_train_pixel_asm)
print ('log loss for train data',(log_loss(y_train_pixel_asm, predict_y, labels=sig_clf.base_estimator.classes_, eps=1e-15)))
predict_y = sig_clf.predict_proba(X_cv_pixel_asm)
print ('log loss for cv data',(log_loss(y_cv_pixel_asm, predict_y, labels=sig_clf.base_estimator.classes_, eps=1e-15)))
predict_y = sig_clf.predict_proba(X_test_pixel_asm)
print ('log loss for test data',(log_loss(y_test_pixel_asm, predict_y, labels=sig_clf.base_estimator.classes_, eps=1e-15)))
plot_confusion_matrix(y_test_pixel_asm,sig_clf.predict(X_test_pixel_asm))
log_loss for c = 1e-05 is 0.8787528387970329 log_loss for c = 0.0001 is 0.8802125047636423 log_loss for c = 0.001 is 0.8439799225298161 log_loss for c = 0.01 is 0.7439277050134949 log_loss for c = 0.1 is 0.7122805394468205 log_loss for c = 1 is 0.686402219854219 log_loss for c = 10 is 0.7301085138114816 log_loss for c = 100 is 0.7353080825692381 log_loss for c = 1000 is 0.7358772917320936
log loss for train data 0.6719300478640129 log loss for cv data 0.686402219854219 log loss for test data 0.673640428063436 Number of misclassified points 79.30082796688133 -------------------------------------------------- Confusion matrix --------------------------------------------------
-------------------------------------------------- Precision matrix --------------------------------------------------
Sum of columns in precision matrix [ 1. 1. 1. 1. nan nan nan 1. 1.] -------------------------------------------------- Recall matrix --------------------------------------------------
Sum of rows in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.]
# --------------------------------
# default parameters
# sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion=’gini’, max_depth=None, min_samples_split=2,
# min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0,
# min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False,
# class_weight=None)
# Some of methods of RandomForestClassifier()
# fit(X, y, [sample_weight]) Fit the SVM model according to the given training data.
# predict(X) Perform classification on samples in X.
# predict_proba (X) Perform classification on samples in X.
# some of attributes of RandomForestClassifier()
# feature_importances_ : array of shape = [n_features]
# The feature importances (the higher, the more important the feature).
# --------------------------------
# video link: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/random-forest-and-their-construction-2/
# --------------------------------
if not os.path.exists('models/uni_pixel_asm_rf.sav'):
alpha=[10,50,100,500,1000,2000,3000]
cv_log_error_array=[]
for i in alpha:
r_cfl=RandomForestClassifier(n_estimators=i,random_state=42,n_jobs=-1)
r_cfl.fit(X_train_pixel_asm,y_train_pixel_asm)
sig_clf = CalibratedClassifierCV(r_cfl, method="sigmoid")
sig_clf.fit(X_train_pixel_asm, y_train_pixel_asm)
predict_y = sig_clf.predict_proba(X_cv_pixel_asm)
cv_log_error_array.append(log_loss(y_cv_pixel_asm, predict_y, labels=r_cfl.classes_, eps=1e-15))
for i in range(len(cv_log_error_array)):
print ('log_loss for c = ',alpha[i],'is',cv_log_error_array[i])
best_alpha = np.argmin(cv_log_error_array)
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
r_cfl=RandomForestClassifier(n_estimators=alpha[best_alpha],random_state=42,n_jobs=-1)
r_cfl.fit(X_train_pixel_asm,y_train_pixel_asm)
sig_clf = CalibratedClassifierCV(r_cfl, method="sigmoid")
sig_clf.fit(X_train_pixel_asm, y_train_pixel_asm)
predict_y = sig_clf.predict_proba(X_train_pixel_asm)
# save the model to disk
pickle.dump(sig_clf, open('models/uni_pixel_asm_rf.sav', 'wb'))
else:
# load the model from disk
sig_clf = pickle.load(open('models/uni_pixel_asm_rf.sav', 'rb'))
print(sig_clf)
predict_y = sig_clf.predict_proba(X_train_pixel_asm)
print ('log loss for train data',(log_loss(y_train_pixel_asm, predict_y, labels=sig_clf.base_estimator.classes_, eps=1e-15)))
predict_y = sig_clf.predict_proba(X_cv_pixel_asm)
print ('log loss for cv data',(log_loss(y_cv_pixel_asm, predict_y, labels=sig_clf.base_estimator.classes_, eps=1e-15)))
predict_y = sig_clf.predict_proba(X_test_pixel_asm)
print ('log loss for test data',(log_loss(y_test_pixel_asm, predict_y, labels=sig_clf.base_estimator.classes_, eps=1e-15)))
plot_confusion_matrix(y_test_pixel_asm,sig_clf.predict(X_test_pixel_asm))
log_loss for c = 10 is 0.21164924133748067 log_loss for c = 50 is 0.2088368866374018 log_loss for c = 100 is 0.20801897022617863 log_loss for c = 500 is 0.20759589699371997 log_loss for c = 1000 is 0.20773015594988398 log_loss for c = 2000 is 0.20770731155498942 log_loss for c = 3000 is 0.2076355141344381
log loss for train data 0.09164322513590176 log loss for cv data 0.20759589699371997 log loss for test data 0.21646396341588559 Number of misclassified points 82.42870285188593 -------------------------------------------------- Confusion matrix --------------------------------------------------
-------------------------------------------------- Precision matrix --------------------------------------------------
Sum of columns in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.] -------------------------------------------------- Recall matrix --------------------------------------------------
Sum of rows in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.]
# Training a hyper-parameter tuned Xg-Boost regressor on our train data
# find more about XGBClassifier function here http://xgboost.readthedocs.io/en/latest/python/python_api.html?#xgboost.XGBClassifier
# -------------------------
# default paramters
# class xgboost.XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True,
# objective='binary:logistic', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1,
# max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1,
# scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)
# some of methods of RandomForestRegressor()
# fit(X, y, sample_weight=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None)
# get_params([deep]) Get parameters for this estimator.
# predict(data, output_margin=False, ntree_limit=0) : Predict with data. NOTE: This function is not thread safe.
# get_score(importance_type='weight') -> get the feature importance
# -----------------------
# video link2: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/what-are-ensembles/
# -----------------------
if not os.path.exists('models/uni_pixel_asm_xgb.sav'):
alpha=[10,50,100,500,1000,2000,3000]
cv_log_error_array=[]
for i in alpha:
x_cfl=XGBClassifier(n_estimators=i,nthread=-1)
x_cfl.fit(X_train_pixel_asm,y_train_pixel_asm)
sig_clf = CalibratedClassifierCV(x_cfl, method="sigmoid")
sig_clf.fit(X_train_pixel_asm, y_train_pixel_asm)
predict_y = sig_clf.predict_proba(X_cv_pixel_asm)
cv_log_error_array.append(log_loss(y_cv_pixel_asm, predict_y, labels=x_cfl.classes_, eps=1e-15))
for i in range(len(cv_log_error_array)):
print ('log_loss for c = ',alpha[i],'is',cv_log_error_array[i])
best_alpha = np.argmin(cv_log_error_array)
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
x_cfl=XGBClassifier(n_estimators=alpha[best_alpha],nthread=-1)
x_cfl.fit(X_train_pixel_asm,y_train_pixel_asm)
sig_clf = CalibratedClassifierCV(x_cfl, method="sigmoid")
sig_clf.fit(X_train_pixel_asm, y_train_pixel_asm)
# save the model to disk
pickle.dump(sig_clf, open('models/uni_pixel_asm_xgb.sav', 'wb'))
else:
# load the model from disk
sig_clf = pickle.load(open('models/uni_pixel_asm_xgb.sav', 'rb'))
print(sig_clf)
predict_y = sig_clf.predict_proba(X_train_pixel_asm)
print ('For values of best alpha = ', sig_clf.base_estimator.n_estimators, "The train log loss is:",log_loss(y_train_pixel_asm, predict_y))
predict_y = sig_clf.predict_proba(X_cv_pixel_asm)
print('For values of best alpha = ', sig_clf.base_estimator.n_estimators, "The cross validation log loss is:",log_loss(y_cv_pixel_asm, predict_y))
predict_y = sig_clf.predict_proba(X_test_pixel_asm)
print('For values of best alpha = ', sig_clf.base_estimator.n_estimators, "The test log loss is:",log_loss(y_test_pixel_asm, predict_y))
plot_confusion_matrix(y_test_pixel_asm,sig_clf.predict(X_test_pixel_asm))
CalibratedClassifierCV(base_estimator=XGBClassifier(base_score=0.5,
booster='gbtree',
colsample_bylevel=1,
colsample_bynode=1,
colsample_bytree=1,
enable_categorical=False,
gamma=0, gpu_id=-1,
importance_type=None,
interaction_constraints='',
learning_rate=0.300000012,
max_delta_step=0,
max_depth=6,
min_child_weight=1,
missing=nan,
monotone_constraints='()',
n_estimators=50, n_jobs=16,
nthread=-1,
num_parallel_tree=1,
objective='multi:softprob',
predictor='auto',
random_state=0, reg_alpha=0,
reg_lambda=1,
scale_pos_weight=None,
subsample=1,
tree_method='exact',
validate_parameters=1,
verbosity=None))
For values of best alpha = 50 The train log loss is: 0.113297062453512
For values of best alpha = 50 The cross validation log loss is: 0.17507064167560846
For values of best alpha = 50 The test log loss is: 0.19786176088082427
Number of misclassified points 4.783808647654094
-------------------------------------------------- Confusion matrix --------------------------------------------------
-------------------------------------------------- Precision matrix --------------------------------------------------
Sum of columns in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.] -------------------------------------------------- Recall matrix --------------------------------------------------
Sum of rows in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.]
if False:
x_cfl=XGBClassifier()
prams={
'learning_rate':[0.01,0.03,0.05,0.1,0.15,0.2],
'n_estimators':[100,200,500,1000,2000],
'max_depth':[3,5,10],
'colsample_bytree':[0.1,0.3,0.5,1],
'subsample':[0.1,0.3,0.5,1]
}
random_cfl=RandomizedSearchCV(x_cfl,param_distributions=prams,verbose=10,n_jobs=-1,)
random_cfl.fit(X_train_pixel_asm,y_train_pixel_asm)
if False:
print (random_cfl.best_params_)
{'subsample': 0.3, 'n_estimators': 1000, 'max_depth': 3, 'learning_rate': 0.2, 'colsample_bytree': 0.1}
# Training a hyper-parameter tuned Xg-Boost regressor on our train data
# find more about XGBClassifier function here http://xgboost.readthedocs.io/en/latest/python/python_api.html?#xgboost.XGBClassifier
# -------------------------
# default paramters
# class xgboost.XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True,
# objective='binary:logistic', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1,
# max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1,
# scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)
# some of methods of RandomForestRegressor()
# fit(X, y, sample_weight=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None)
# get_params([deep]) Get parameters for this estimator.
# predict(data, output_margin=False, ntree_limit=0) : Predict with data. NOTE: This function is not thread safe.
# get_score(importance_type='weight') -> get the feature importance
# -----------------------
# video link2: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/what-are-ensembles/
# -----------------------
x_cfl=XGBClassifier(n_estimators=1000,subsample=0.3,learning_rate=0.2,colsample_bytree=0.1,max_depth=3)
x_cfl.fit(X_train_pixel_asm,y_train_pixel_asm)
c_cfl=CalibratedClassifierCV(x_cfl,method='sigmoid')
c_cfl.fit(X_train_pixel_asm,y_train_pixel_asm)
predict_y = c_cfl.predict_proba(X_train_pixel_asm)
print ('train loss',log_loss(y_train_pixel_asm, predict_y))
predict_y = c_cfl.predict_proba(X_cv_pixel_asm)
print ('cv loss',log_loss(y_cv_pixel_asm, predict_y))
predict_y = c_cfl.predict_proba(X_test_pixel_asm)
print ('test loss',log_loss(y_test_pixel_asm, predict_y))
[10:29:43] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [10:30:02] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [10:30:16] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [10:30:31] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [10:30:46] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [10:31:02] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. train loss 0.14931584738978887 cv loss 0.18012644472732228 test loss 0.19312309841657685
%%time
if False:
tokens = '00,01,02,03,04,05,06,07,08,09,0a,0b,0c,0d,0e,0f,10,11,12,13,14,15,16,17,18,19,1a,1b,1c,1d,1e,1f,20,21,22,23,24,25,26,27,28,29,2a,2b,2c,2d,2e,2f,30,31,32,33,34,35,36,37,38,39,3a,3b,3c,3d,3e,3f,40,41,42,43,44,45,46,47,48,49,4a,4b,4c,4d,4e,4f,50,51,52,53,54,55,56,57,58,59,5a,5b,5c,5d,5e,5f,60,61,62,63,64,65,66,67,68,69,6a,6b,6c,6d,6e,6f,70,71,72,73,74,75,76,77,78,79,7a,7b,7c,7d,7e,7f,80,81,82,83,84,85,86,87,88,89,8a,8b,8c,8d,8e,8f,90,91,92,93,94,95,96,97,98,99,9a,9b,9c,9d,9e,9f,a0,a1,a2,a3,a4,a5,a6,a7,a8,a9,aa,ab,ac,ad,ae,af,b0,b1,b2,b3,b4,b5,b6,b7,b8,b9,ba,bb,bc,bd,be,bf,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,ca,cb,cc,cd,ce,cf,d0,d1,d2,d3,d4,d5,d6,d7,d8,d9,da,db,dc,dd,de,df,e0,e1,e2,e3,e4,e5,e6,e7,e8,e9,ea,eb,ec,ed,ee,ef,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,fa,fb,fc,fd,fe,ff'
bigram_tokens = []
for i in tokens.split(','):
for j in tokens.split(','):
bigram_tokens.append(i+j)
special_before = ['??'+i for i in tokens.split(',')]
special_after = [i+'??' for i in tokens.split(',')]
bigram_tokens.insert(0,'ID')
bigram_tokens.extend(special_before)
bigram_tokens.extend(special_after)
bigram_tokens.append('????')
print("Length of all the tokens in the byte file", len(bigram_tokens))
#logic
if not os.path.exists('bigram_byte.csv'):
bi_data = pd.DataFrame(columns=bigram_tokens)
files = os.listdir('byteFiles')
#bytes_features_bigram_file = open('result_bigram.csv','w+')
#byte_feature_file.write("\n")
rootDict = []
print("Creating Bi-gram of Byte Files")
print("Estimated time: 2-3 hrs")
for file in tqdm(files):
with open('byteFiles/'+file,"r") as byte_flie:
tempDict = {key: 0 for key in bigram_tokens}
tempDict['ID'] = file.split(".")[0]
#print(tempDict['ID'])
for line in byte_flie:
line = "".join(line.lower().rstrip().split(" "))
#print(line)
for i in range(0, len(line)-3, 2):
try:
tempDict[line[i:i+4]] += 1
except:
pass
rootDict.append(tempDict)
print("Dictionary of all Byte files.")
bi_data = bi_data.append(rootDict, ignore_index=True, sort=False)
bi_gram_byteFiles = pd.merge(bi_data, Y, how='inner', on='ID')
bi_gram_byteFiles.to_csv('bigram_byte.csv', index=False)
else:
if not os.path.exists("bigram_size_byte.csv"):
print("Bigram byte files already exists.")
print("Reading File...")
print("Estimated time: 10 mins")
bi_tokens_dtype = {}
for i in bigram_tokens:
if i == 'ID':
bi_tokens_dtype[i] = np.str_
else:
bi_tokens_dtype[i] = np.uint16
bi_tokens_dtype['Class'] = np.uint8
bi_gram_byteFiles = pd.read_csv('bigram_byte.csv', dtype = bi_tokens_dtype)
print("Merging the bigram Byte FIle with with size")
bi_byte_features_with_size = bi_gram_byteFiles.merge(data_size_byte, on='ID')
bi_byte_features_with_size['Class'] = bi_byte_features_with_size['Class_x']
bi_byte_features_with_size = bi_byte_features_with_size.drop(columns = ['Class_x', 'Class_y'])
bi_byte_features_with_size.to_csv("bigram_size_byte.csv", index = False)
else:
print("Bigram byte files with size already exists.")
print("Reading File...")
print("Estimated time: 10 mins")
bi_tokens_dtype = {}
for i in bigram_tokens:
if i == 'ID':
bi_tokens_dtype[i] = np.str_
else:
bi_tokens_dtype[i] = np.uint16
bi_tokens_dtype[i]
bi_tokens_dtype['Class'] = np.uint8
bi_byte_features_with_size = pd.read_csv("bigram_size_byte.csv", dtype = bi_tokens_dtype)
result_bigram_byte = bi_byte_features_with_size.copy()
Wall time: 0 ns
We selected 1000 features by training 100 trees of max_depth = 100 (shallow compared to max dimension) based on feature importance.
Since normalization of 66k featrures will take lot of time and Decision Tree based algorithm will not require normalization.
%%time
if not os.path.exists('top_result_bigram_byte.csv'):
X = result_bigram_byte.drop(['ID', 'Class'], axis=1)
y = result_bigram_byte['Class']
rf = RandomForestClassifier(n_estimators = 100, max_depth = 100, n_jobs = -1)
rf.fit(X, y)
imp_feature_indx = np.argsort(rf.feature_importances_)[::-1]
imp_feature_name = X.columns[imp_feature_indx[::-1][:1000]]
print('size' in imp_feature_name) # size is not in top 1000 important feature.
top_result_bigram_byte = X[imp_feature_name]
top_result_bigram_byte['ID'] = result_bigram_byte['ID']
top_result_bigram_byte['Class'] = result_bigram_byte['Class']
top_result_bigram_byte.to_csv('top_result_bigram_byte.csv', index=False)
else:
top_result_bigram_byte = pd.read_csv('top_result_bigram_byte.csv')
False Wall time: 23.5 s
top_result_bigram_byte
| 802d | 9ca7 | 9ca9 | 9cac | 9cad | 9cae | 9caf | 9cb0 | 9cb1 | 9cb2 | ... | a0a4 | a0a6 | a0c8 | a0c9 | a0cb | a0cc | a0ce | a0cf | ID | Class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4 | 1 | 6 | 12 | 3 | 4 | 4 | 2 | 5 | 4 | ... | 6 | 8 | 6 | 8 | 5 | 9 | 7 | 2 | 01azqd4InC7m9JpocGv5 | 9 |
| 1 | 0 | 1 | 1 | 0 | 14 | 0 | 0 | 0 | 0 | 1 | ... | 8 | 1 | 1 | 0 | 7 | 0 | 0 | 7 | 01IsoiSMh5gxyDYTl4CB | 2 |
| 2 | 4 | 12 | 7 | 4 | 8 | 7 | 6 | 8 | 8 | 4 | ... | 6 | 5 | 5 | 9 | 286 | 4 | 6 | 3 | 01jsnpXSAlgw6aPeDxrU | 9 |
| 3 | 2 | 0 | 0 | 4 | 3 | 1 | 1 | 4 | 0 | 1 | ... | 4 | 1 | 2 | 1 | 3 | 0 | 2 | 0 | 01kcPWA9K2BOxQeS5Rju | 1 |
| 4 | 2 | 1 | 1 | 2 | 0 | 1 | 0 | 1 | 1 | 1 | ... | 14 | 0 | 0 | 2 | 0 | 0 | 2 | 1 | 01SuzwMJEIXsK7A8dQbl | 8 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10863 | 7 | 2 | 8 | 1 | 4 | 7 | 4 | 5 | 4 | 6 | ... | 5 | 1 | 6 | 6 | 6 | 2 | 3 | 4 | loIP1tiwELF9YNZQjSUO | 4 |
| 10864 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | ... | 0 | 1 | 2 | 0 | 0 | 0 | 0 | 1 | LOP6HaJKXpkic5dyuVnT | 4 |
| 10865 | 0 | 1 | 1 | 0 | 3 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | LOqA6FX02GWguYrI1Zbe | 4 |
| 10866 | 2 | 1 | 0 | 1 | 0 | 0 | 2 | 3 | 0 | 0 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | LoWgaidpb2IUM5ACcSGO | 4 |
| 10867 | 0 | 0 | 1 | 3 | 1 | 0 | 1 | 1 | 1 | 0 | ... | 0 | 1 | 2 | 1 | 2 | 1 | 1 | 1 | lS0IVqXeJrN6Dzi9Pap1 | 4 |
10868 rows × 1002 columns
# we normalize the data each column
top_result_bigram_byte = normalize(top_result_bigram_byte)
top_result_bigram_byte.head()
| 802d | 9ca7 | 9ca9 | 9cac | 9cad | 9cae | 9caf | 9cb0 | 9cb1 | 9cb2 | ... | a0a4 | a0a6 | a0c8 | a0c9 | a0cb | a0cc | a0ce | a0cf | ID | Class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.000709 | 0.001042 | 0.001011 | 0.000759 | 0.011494 | 0.005135 | 0.021505 | 0.006734 | 0.023041 | 0.005882 | ... | 0.000706 | 0.040404 | 0.014963 | 0.040816 | 0.017483 | 0.024457 | 0.036842 | 0.008032 | 01azqd4InC7m9JpocGv5 | 9 |
| 1 | 0.000000 | 0.001042 | 0.000169 | 0.000000 | 0.053640 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.001471 | ... | 0.000941 | 0.005051 | 0.002494 | 0.000000 | 0.024476 | 0.000000 | 0.000000 | 0.028112 | 01IsoiSMh5gxyDYTl4CB | 2 |
| 2 | 0.000709 | 0.012500 | 0.001180 | 0.000253 | 0.030651 | 0.008986 | 0.032258 | 0.026936 | 0.036866 | 0.005882 | ... | 0.000706 | 0.025253 | 0.012469 | 0.045918 | 1.000000 | 0.010870 | 0.031579 | 0.012048 | 01jsnpXSAlgw6aPeDxrU | 9 |
| 3 | 0.000355 | 0.000000 | 0.000000 | 0.000253 | 0.011494 | 0.001284 | 0.005376 | 0.013468 | 0.000000 | 0.001471 | ... | 0.000471 | 0.005051 | 0.004988 | 0.005102 | 0.010490 | 0.000000 | 0.010526 | 0.000000 | 01kcPWA9K2BOxQeS5Rju | 1 |
| 4 | 0.000355 | 0.001042 | 0.000169 | 0.000127 | 0.000000 | 0.001284 | 0.000000 | 0.003367 | 0.004608 | 0.001471 | ... | 0.001647 | 0.000000 | 0.000000 | 0.010204 | 0.000000 | 0.000000 | 0.010526 | 0.004016 | 01SuzwMJEIXsK7A8dQbl | 8 |
5 rows × 1002 columns
a = top_result_bigram_byte.isnull().all()
a[a==True]
??e9 True ??e8 True dtype: bool
bigram_byte_y = top_result_bigram_byte['Class']
bigram_byte_x = top_result_bigram_byte.drop(['ID','Class','??e9','??e8'], axis=1)
X_train_bigram_byte, X_test_bigram_byte, y_train_bigram_byte, y_test_bigram_byte = train_test_split(bigram_byte_x,bigram_byte_y ,stratify=bigram_byte_y,test_size=0.20)
X_train_bigram_byte, X_cv_bigram_byte, y_train_bigram_byte, y_cv_bigram_byte = train_test_split(X_train_bigram_byte, y_train_bigram_byte,stratify=y_train_bigram_byte,test_size=0.20)
# find more about KNeighborsClassifier() here http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
# -------------------------
# default parameter
# KNeighborsClassifier(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2,
# metric=’minkowski’, metric_params=None, n_jobs=1, **kwargs)
# methods of
# fit(X, y) : Fit the model using X as training data and y as target values
# predict(X):Predict the class labels for the provided data
# predict_proba(X):Return probability estimates for the test data X.
#-------------------------------------
# video link: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/k-nearest-neighbors-geometric-intuition-with-a-toy-example-1/
#-------------------------------------
# find more about CalibratedClassifierCV here at http://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html
# ----------------------------
# default paramters
# sklearn.calibration.CalibratedClassifierCV(base_estimator=None, method=’sigmoid’, cv=3)
#
# some of the methods of CalibratedClassifierCV()
# fit(X, y[, sample_weight]) Fit the calibrated model
# get_params([deep]) Get parameters for this estimator.
# predict(X) Predict the target of new samples.
# predict_proba(X) Posterior probabilities of classification
#-------------------------------------
# video link:
#-------------------------------------
if not os.path.exists('models/bigram_byte_knn.sav'):
alpha = [x for x in range(1, 21,2)]
cv_log_error_array=[]
for i in alpha:
k_cfl=KNeighborsClassifier(n_neighbors=i)
k_cfl.fit(X_train_bigram_byte,y_train_bigram_byte)
sig_clf = CalibratedClassifierCV(k_cfl, method="sigmoid")
sig_clf.fit(X_train_bigram_byte, y_train_bigram_byte)
predict_y = sig_clf.predict_proba(X_cv_bigram_byte)
cv_log_error_array.append(log_loss(y_cv_bigram_byte, predict_y, labels=k_cfl.classes_, eps=1e-15))
for i in range(len(cv_log_error_array)):
print ('log_loss for k = ',alpha[i],'is',cv_log_error_array[i])
best_alpha = np.argmin(cv_log_error_array)
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
k_cfl=KNeighborsClassifier(n_neighbors=alpha[best_alpha])
k_cfl.fit(X_train_bigram_byte,y_train_bigram_byte)
sig_clf = CalibratedClassifierCV(k_cfl, method="sigmoid")
sig_clf.fit(X_train_bigram_byte, y_train_bigram_byte)
pred_y=sig_clf.predict(X_test_bigram_byte)
# save the model to disk
pickle.dump(sig_clf, open('models/bigram_byte_knn.sav', 'wb'))
else:
# load the model from disk
sig_clf = pickle.load(open('models/bigram_byte_knn.sav', 'rb'))
print(sig_clf)
predict_y = sig_clf.predict_proba(X_train_bigram_byte)
print ('log loss for train data',log_loss(y_train_bigram_byte, predict_y))
predict_y = sig_clf.predict_proba(X_cv_bigram_byte)
print ('log loss for cv data',log_loss(y_cv_bigram_byte, predict_y))
predict_y = sig_clf.predict_proba(X_test_bigram_byte)
print ('log loss for test data',log_loss(y_test_bigram_byte, predict_y))
plot_confusion_matrix(y_test_bigram_byte,sig_clf.predict(X_test_bigram_byte))
CalibratedClassifierCV(base_estimator=KNeighborsClassifier(n_neighbors=3)) log loss for train data 0.40875763586358393 log loss for cv data 0.7043270448604847 log loss for test data 0.6856947479122154 Number of misclassified points 19.641214351425944 -------------------------------------------------- Confusion matrix --------------------------------------------------
-------------------------------------------------- Precision matrix --------------------------------------------------
Sum of columns in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.] -------------------------------------------------- Recall matrix --------------------------------------------------
Sum of rows in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.]
# read more about SGDClassifier() at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
# ------------------------------
# default parameters
# SGDClassifier(loss=’hinge’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=None, tol=None,
# shuffle=True, verbose=0, epsilon=0.1, n_jobs=1, random_state=None, learning_rate=’optimal’, eta0=0.0, power_t=0.5,
# class_weight=None, warm_start=False, average=False, n_iter=None)
# some of methods
# fit(X, y[, coef_init, intercept_init, …]) Fit linear model with Stochastic Gradient Descent.
# predict(X) Predict class labels for samples in X.
#-------------------------------
# video link: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/geometric-intuition-1/
#------------------------------
if not os.path.exists('models/bigram_byte_lr.sav'):
alpha = [10 ** x for x in range(-5, 4)]
cv_log_error_array=[]
for i in alpha:
logisticR=LogisticRegression(penalty='l2',C=i,class_weight='balanced')
logisticR.fit(X_train_pixel_asm,y_train_pixel_asm)
sig_clf = CalibratedClassifierCV(logisticR, method="sigmoid")
sig_clf.fit(X_train_bigram_byte, y_train_bigram_byte)
predict_y = sig_clf.predict_proba(X_cv_bigram_byte)
cv_log_error_array.append(log_loss(y_cv_bigram_byte, predict_y, labels=logisticR.classes_, eps=1e-15))
for i in range(len(cv_log_error_array)):
print ('log_loss for c = ',alpha[i],'is',cv_log_error_array[i])
best_alpha = np.argmin(cv_log_error_array)
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
logisticR=LogisticRegression(penalty='l2',C=alpha[best_alpha],class_weight='balanced')
logisticR.fit(X_train_bigram_byte,y_train_bigram_byte)
sig_clf = CalibratedClassifierCV(logisticR, method="sigmoid")
sig_clf.fit(X_train_bigram_byte, y_train_bigram_byte)
# save the model to disk
pickle.dump(sig_clf, open('models/bigram_byte_lr.sav', 'wb'))
else:
# load the model from disk
sig_clf = pickle.load(open('models/bigram_byte_lr.sav', 'rb'))
print(sig_clf)
predict_y = sig_clf.predict_proba(X_train_bigram_byte)
print ('log loss for train data',(log_loss(y_train_bigram_byte, predict_y, labels=sig_clf.base_estimator.classes_, eps=1e-15)))
predict_y = sig_clf.predict_proba(X_cv_bigram_byte)
print ('log loss for cv data',(log_loss(y_cv_bigram_byte, predict_y, labels=sig_clf.base_estimator.classes_, eps=1e-15)))
predict_y = sig_clf.predict_proba(X_test_bigram_byte)
print ('log loss for test data',(log_loss(y_test_bigram_byte, predict_y, labels=sig_clf.base_estimator.classes_, eps=1e-15)))
plot_confusion_matrix(y_test_bigram_byte,sig_clf.predict(X_test_bigram_byte))
log_loss for c = 1e-05 is 1.2359783166767002 log_loss for c = 0.0001 is 1.235519368028496 log_loss for c = 0.001 is 1.230409182216696 log_loss for c = 0.01 is 1.1889702270357474 log_loss for c = 0.1 is 1.0285331659927939 log_loss for c = 1 is 0.9473340856040805 log_loss for c = 10 is 0.9359818011229696 log_loss for c = 100 is 0.9296016457220023 log_loss for c = 1000 is 0.9628911726752428
log loss for train data 0.8698622054138794 log loss for cv data 0.9296016457220023 log loss for test data 0.9402708372789486 Number of misclassified points 25.666973321067154 -------------------------------------------------- Confusion matrix --------------------------------------------------
-------------------------------------------------- Precision matrix --------------------------------------------------
Sum of columns in precision matrix [ 1. 1. 1. 1. nan 1. 1. 1. 1.] -------------------------------------------------- Recall matrix --------------------------------------------------
Sum of rows in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.]
# --------------------------------
# default parameters
# sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion=’gini’, max_depth=None, min_samples_split=2,
# min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0,
# min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False,
# class_weight=None)
# Some of methods of RandomForestClassifier()
# fit(X, y, [sample_weight]) Fit the SVM model according to the given training data.
# predict(X) Perform classification on samples in X.
# predict_proba (X) Perform classification on samples in X.
# some of attributes of RandomForestClassifier()
# feature_importances_ : array of shape = [n_features]
# The feature importances (the higher, the more important the feature).
# --------------------------------
# video link: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/random-forest-and-their-construction-2/
# --------------------------------
if not os.path.exists('models/bigram_byte_rf.sav'):
alpha=[10,50,100,500,1000,2000,3000]
cv_log_error_array=[]
for i in alpha:
r_cfl=RandomForestClassifier(n_estimators=i,random_state=42,n_jobs=-1)
r_cfl.fit(X_train_bigram_byte,y_train_bigram_byte)
sig_clf = CalibratedClassifierCV(r_cfl, method="sigmoid")
sig_clf.fit(X_train_bigram_byte, y_train_bigram_byte)
predict_y = sig_clf.predict_proba(X_cv_bigram_byte)
cv_log_error_array.append(log_loss(y_cv_bigram_byte, predict_y, labels=r_cfl.classes_, eps=1e-15))
for i in range(len(cv_log_error_array)):
print ('log_loss for c = ',alpha[i],'is',cv_log_error_array[i])
best_alpha = np.argmin(cv_log_error_array)
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
r_cfl=RandomForestClassifier(n_estimators=alpha[best_alpha],random_state=42,n_jobs=-1)
r_cfl.fit(X_train_bigram_byte,y_train_bigram_byte)
sig_clf = CalibratedClassifierCV(r_cfl, method="sigmoid")
sig_clf.fit(X_train_bigram_byte, y_train_bigram_byte)
predict_y = sig_clf.predict_proba(X_train_bigram_byte)
# save the model to disk
pickle.dump(sig_clf, open('models/bigram_byte_rf.sav', 'wb'))
else:
# load the model from disk
sig_clf = pickle.load(open('models/bigram_byte_rf.sav', 'rb'))
print(sig_clf)
predict_y = sig_clf.predict_proba(X_train_bigram_byte)
print ('log loss for train data',(log_loss(y_train_bigram_byte, predict_y, labels=sig_clf.base_estimator.classes_, eps=1e-15)))
predict_y = sig_clf.predict_proba(X_cv_bigram_byte)
print ('log loss for cv data',(log_loss(y_cv_bigram_byte, predict_y, labels=sig_clf.base_estimator.classes_, eps=1e-15)))
predict_y = sig_clf.predict_proba(X_test_bigram_byte)
print ('log loss for test data',(log_loss(y_test_bigram_byte, predict_y, labels=sig_clf.base_estimator.classes_, eps=1e-15)))
plot_confusion_matrix(y_test_bigram_byte,sig_clf.predict(X_test_bigram_byte))
log_loss for c = 10 is 0.4241031520266062 log_loss for c = 50 is 0.32716684840606175 log_loss for c = 100 is 0.31059318364965 log_loss for c = 500 is 0.2985834956372987 log_loss for c = 1000 is 0.29810426146714397 log_loss for c = 2000 is 0.29651743554205656 log_loss for c = 3000 is 0.2962916979701253
log loss for train data 0.05505888014059545 log loss for cv data 0.2962916979701253 log loss for test data 0.2929327636879 Number of misclassified points 81.18675252989881 -------------------------------------------------- Confusion matrix --------------------------------------------------
-------------------------------------------------- Precision matrix --------------------------------------------------
Sum of columns in precision matrix [ 1. 1. 1. 1. nan 1. 1. 1. 1.] -------------------------------------------------- Recall matrix --------------------------------------------------
Sum of rows in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.]
# Training a hyper-parameter tuned Xg-Boost regressor on our train data
# find more about XGBClassifier function here http://xgboost.readthedocs.io/en/latest/python/python_api.html?#xgboost.XGBClassifier
# -------------------------
# default paramters
# class xgboost.XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True,
# objective='binary:logistic', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1,
# max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1,
# scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)
# some of methods of RandomForestRegressor()
# fit(X, y, sample_weight=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None)
# get_params([deep]) Get parameters for this estimator.
# predict(data, output_margin=False, ntree_limit=0) : Predict with data. NOTE: This function is not thread safe.
# get_score(importance_type='weight') -> get the feature importance
# -----------------------
# video link2: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/what-are-ensembles/
# -----------------------
if not os.path.exists('models/bigram_byte_xgb.sav'):
alpha=[10,50,100,500,1000,2000,3000]
cv_log_error_array=[]
for i in alpha:
x_cfl=XGBClassifier(n_estimators=i,nthread=-1)
x_cfl.fit(X_train_bigram_byte,y_train_bigram_byte)
sig_clf = CalibratedClassifierCV(x_cfl, method="sigmoid")
sig_clf.fit(X_train_bigram_byte, y_train_bigram_byte)
predict_y = sig_clf.predict_proba(X_cv_bigram_byte)
cv_log_error_array.append(log_loss(y_cv_bigram_byte, predict_y, labels=x_cfl.classes_, eps=1e-15))
for i in range(len(cv_log_error_array)):
print ('log_loss for c = ',alpha[i],'is',cv_log_error_array[i])
best_alpha = np.argmin(cv_log_error_array)
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
x_cfl=XGBClassifier(n_estimators=alpha[best_alpha],nthread=-1)
x_cfl.fit(X_train_bigram_byte,y_train_bigram_byte)
sig_clf = CalibratedClassifierCV(x_cfl, method="sigmoid")
sig_clf.fit(X_train_bigram_byte, y_train_bigram_byte)
# save the model to disk
pickle.dump(sig_clf, open('models/bigram_byte_xgb.sav', 'wb'))
else:
# load the model from disk
sig_clf = pickle.load(open('models/bigram_byte_xgb.sav', 'rb'))
print(sig_clf)
predict_y = sig_clf.predict_proba(X_train_bigram_byte)
print ('For values of best alpha = ', sig_clf.base_estimator.n_estimators, "The train log loss is:",log_loss(y_train_bigram_byte, predict_y))
predict_y = sig_clf.predict_proba(X_cv_bigram_byte)
print('For values of best alpha = ', sig_clf.base_estimator.n_estimators, "The cross validation log loss is:",log_loss(y_cv_bigram_byte, predict_y))
predict_y = sig_clf.predict_proba(X_test_bigram_byte)
print('For values of best alpha = ', sig_clf.base_estimator.n_estimators, "The test log loss is:",log_loss(y_test_bigram_byte, predict_y))
plot_confusion_matrix(y_test_bigram_byte,sig_clf.predict(X_test_bigram_byte))
[13:48:40] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [13:48:42] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [13:48:45] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [13:48:47] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [13:48:49] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [13:48:51] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [13:48:53] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [13:49:04] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [13:49:12] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [13:49:22] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [13:49:30] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [13:49:38] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [13:49:47] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [13:50:07] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [13:50:22] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [13:50:37] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [13:50:52] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [13:51:07] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [13:51:22] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [13:52:18] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [13:53:01] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [13:53:43] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [13:54:25] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [13:55:09] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [13:55:51] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [13:57:22] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [13:58:30] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [13:59:37] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:00:45] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:01:53] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:03:01] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:05:32] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:07:26] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:09:19] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:11:14] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:13:10] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:15:04] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:18:31] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:35:09] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:37:27] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:40:13] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:43:07] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. log_loss for c = 10 is 0.29173793118598346 log_loss for c = 50 is 0.21819178087331897 log_loss for c = 100 is 0.21550939469879363 log_loss for c = 500 is 0.2143695711489488 log_loss for c = 1000 is 0.21423025730152923 log_loss for c = 2000 is 0.21510047758515016 log_loss for c = 3000 is 0.2157770921672099
[14:45:53] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:47:27] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:48:41] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:49:48] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:50:57] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:52:04] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. For values of best alpha = 1000 The train log loss is: 0.06661155445132415 For values of best alpha = 1000 The cross validation log loss is: 0.21423025730152923 For values of best alpha = 1000 The test log loss is: 0.22263655173936306 Number of misclassified points 5.197792088316467 -------------------------------------------------- Confusion matrix --------------------------------------------------
-------------------------------------------------- Precision matrix --------------------------------------------------
Sum of columns in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.] -------------------------------------------------- Recall matrix --------------------------------------------------
Sum of rows in precision matrix [1. 1. 1. 1. 1. 1. 1. 1. 1.]
if False:
x_cfl=XGBClassifier()
prams={
'learning_rate':[0.01,0.03,0.05,0.1,0.15,0.2],
'n_estimators':[100,200,500,1000,2000],
'max_depth':[3,5,10],
'colsample_bytree':[0.1,0.3,0.5,1],
'subsample':[0.1,0.3,0.5,1]
}
random_cfl=RandomizedSearchCV(x_cfl,param_distributions=prams,verbose=10,n_jobs=-1,)
random_cfl.fit(X_train_bigram_byte,y_train_bigram_byte)
Fitting 5 folds for each of 10 candidates, totalling 50 fits [15:15:39] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
if False:
print (random_cfl.best_params_)
{'subsample': 0.5, 'n_estimators': 1000, 'max_depth': 5, 'learning_rate': 0.1, 'colsample_bytree': 0.5}
# Training a hyper-parameter tuned Xg-Boost regressor on our train data
# find more about XGBClassifier function here http://xgboost.readthedocs.io/en/latest/python/python_api.html?#xgboost.XGBClassifier
# -------------------------
# default paramters
# class xgboost.XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True,
# objective='binary:logistic', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1,
# max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1,
# scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)
# some of methods of RandomForestRegressor()
# fit(X, y, sample_weight=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None)
# get_params([deep]) Get parameters for this estimator.
# predict(data, output_margin=False, ntree_limit=0) : Predict with data. NOTE: This function is not thread safe.
# get_score(importance_type='weight') -> get the feature importance
# -----------------------
# video link2: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/what-are-ensembles/
# -----------------------
x_cfl=XGBClassifier(n_estimators=1000,subsample=0.5,learning_rate=0.1,colsample_bytree=0.5,max_depth=5)
x_cfl.fit(X_train_bigram_byte,y_train_bigram_byte)
c_cfl=CalibratedClassifierCV(x_cfl,method='sigmoid')
c_cfl.fit(X_train_bigram_byte,y_train_bigram_byte)
predict_y = c_cfl.predict_proba(X_train_bigram_byte)
print ('train loss',log_loss(y_train_bigram_byte, predict_y))
predict_y = c_cfl.predict_proba(X_cv_bigram_byte)
print ('cv loss',log_loss(y_cv_bigram_byte, predict_y))
predict_y = c_cfl.predict_proba(X_test_bigram_byte)
print ('test loss',log_loss(y_test_bigram_byte, predict_y))
[19:42:01] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [19:44:39] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [19:46:35] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [19:48:18] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [19:50:07] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [19:52:24] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. train loss 0.05802943379524719 cv loss 0.20462100069658026 test loss 0.20785626909687324
uni_byte_size = pd.read_csv('result_with_size.csv').drop(columns=['Unnamed: 0'])
print(uni_byte_size.shape)
uni_asm_size = pd.read_csv('asmoutputfile.csv')
print(uni_asm_size.shape)
pixel_asm = pd.read_csv('pixel_asm.csv')
print(pixel_asm.shape)
bigram_byte_size = pd.read_csv('top_result_bigram_byte.csv')
print(bigram_byte_size.shape)
(10868, 260) (10868, 52) (10868, 801) (10868, 1002)
data = uni_byte_size.merge(uni_asm_size, on=['ID'], how = 'inner').merge(pixel_asm, on=['ID'], how = 'inner').merge(bigram_byte_size, on=['ID'], how = 'inner')
data['Class'] = data['Class_x']
data = data.drop(['ID','Class_x','Class_y','??e9', '??e8', '.BSS:','rtn','.CODE','18_y','19_y'], axis=1)
X = data.drop(['Class'], axis=1)
y = data['Class']
X = normalize(X)
print(X.shape)
print(y.shape)
(10868, 2102) (10868,)
xtsne=TSNE(perplexity=50)
results=xtsne.fit_transform(X)
vis_x = results[:, 0]
vis_y = results[:, 1]
plt.scatter(vis_x, vis_y, c=y, cmap=plt.cm.get_cmap("jet", 9))
plt.colorbar(ticks=range(9))
plt.clim(0.5, 9)
plt.show()
xtsne=TSNE(perplexity=30)
results=xtsne.fit_transform(X)
vis_x = results[:, 0]
vis_y = results[:, 1]
plt.scatter(vis_x, vis_y, c=y, cmap=plt.cm.get_cmap("jet", 9))
plt.colorbar(ticks=range(9))
plt.clim(0.5, 9)
plt.show()
X_train, X_test_merge, y_train, y_test_merge = train_test_split(X, y,stratify=y,test_size=0.20)
X_train_merge, X_cv_merge, y_train_merge, y_cv_merge = train_test_split(X_train, y_train,stratify=y_train,test_size=0.20)
# --------------------------------
# default parameters
# sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion=’gini’, max_depth=None, min_samples_split=2,
# min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0,
# min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False,
# class_weight=None)
# Some of methods of RandomForestClassifier()
# fit(X, y, [sample_weight]) Fit the SVM model according to the given training data.
# predict(X) Perform classification on samples in X.
# predict_proba (X) Perform classification on samples in X.
# some of attributes of RandomForestClassifier()
# feature_importances_ : array of shape = [n_features]
# The feature importances (the higher, the more important the feature).
# --------------------------------
# video link: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/random-forest-and-their-construction-2/
# --------------------------------
alpha=[10,50,100,500,1000,2000,3000]
cv_log_error_array=[]
from sklearn.ensemble import RandomForestClassifier
for i in alpha:
r_cfl=RandomForestClassifier(n_estimators=i,random_state=42,n_jobs=-1)
r_cfl.fit(X_train_merge,y_train_merge)
sig_clf = CalibratedClassifierCV(r_cfl, method="sigmoid")
sig_clf.fit(X_train_merge, y_train_merge)
predict_y = sig_clf.predict_proba(X_cv_merge)
cv_log_error_array.append(log_loss(y_cv_merge, predict_y, labels=r_cfl.classes_, eps=1e-15))
for i in range(len(cv_log_error_array)):
print ('log_loss for c = ',alpha[i],'is',cv_log_error_array[i])
best_alpha = np.argmin(cv_log_error_array)
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
r_cfl=RandomForestClassifier(n_estimators=alpha[best_alpha],random_state=42,n_jobs=-1)
r_cfl.fit(X_train_merge,y_train_merge)
sig_clf = CalibratedClassifierCV(r_cfl, method="sigmoid")
sig_clf.fit(X_train_merge, y_train_merge)
predict_y = sig_clf.predict_proba(X_train_merge)
print ('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(y_train_merge, predict_y))
predict_y = sig_clf.predict_proba(X_cv_merge)
print('For values of best alpha = ', alpha[best_alpha], "The cross validation log loss is:",log_loss(y_cv_merge, predict_y))
predict_y = sig_clf.predict_proba(X_test_merge)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(y_test_merge, predict_y))
log_loss for c = 10 is 0.0387275672817498 log_loss for c = 50 is 0.04048531366346409 log_loss for c = 100 is 0.041424579968752914 log_loss for c = 500 is 0.041242463893518794 log_loss for c = 1000 is 0.041394019693367985 log_loss for c = 2000 is 0.041397579501051364 log_loss for c = 3000 is 0.04139688255395738
For values of best alpha = 10 The train log loss is: 0.013746955309086534 For values of best alpha = 10 The cross validation log loss is: 0.0387275672817498 For values of best alpha = 10 The test log loss is: 0.04310718541371111
# Training a hyper-parameter tuned Xg-Boost regressor on our train data
# find more about XGBClassifier function here http://xgboost.readthedocs.io/en/latest/python/python_api.html?#xgboost.XGBClassifier
# -------------------------
# default paramters
# class xgboost.XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True,
# objective='binary:logistic', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1,
# max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1,
# scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)
# some of methods of RandomForestRegressor()
# fit(X, y, sample_weight=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None)
# get_params([deep]) Get parameters for this estimator.
# predict(data, output_margin=False, ntree_limit=0) : Predict with data. NOTE: This function is not thread safe.
# get_score(importance_type='weight') -> get the feature importance
# -----------------------
# video link2: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/what-are-ensembles/
# -----------------------
if False: #best parameter is 2000, no need to
alpha=[10,50,100,500,1000,2000,3000]
cv_log_error_array=[]
for i in alpha:
x_cfl=XGBClassifier(n_estimators=i)
x_cfl.fit(X_train_merge,y_train_merge)
sig_clf = CalibratedClassifierCV(x_cfl, method="sigmoid")
sig_clf.fit(X_train_merge, y_train_merge)
predict_y = sig_clf.predict_proba(X_cv_merge)
cv_log_error_array.append(log_loss(y_cv_merge, predict_y, labels=x_cfl.classes_, eps=1e-15))
for i in range(len(cv_log_error_array)):
print ('log_loss for c = ',alpha[i],'is',cv_log_error_array[i])
best_alpha = np.argmin(cv_log_error_array)
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
x_cfl=XGBClassifier(n_estimators=2000,nthread=-1)
x_cfl.fit(X_train_merge,y_train_merge,verbose=True)
sig_clf = CalibratedClassifierCV(x_cfl, method="sigmoid")
sig_clf.fit(X_train_merge, y_train_merge)
predict_y = sig_clf.predict_proba(X_train_merge)
print ('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(y_train_merge, predict_y))
predict_y = sig_clf.predict_proba(X_cv_merge)
print('For values of best alpha = ', alpha[best_alpha], "The cross validation log loss is:",log_loss(y_cv_merge, predict_y))
predict_y = sig_clf.predict_proba(X_test_merge)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(y_test_merge, predict_y))
[20:58:21] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [20:58:26] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [20:58:30] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [20:58:34] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [20:58:39] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [20:58:43] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [20:58:48] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [20:59:04] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [20:59:16] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [20:59:31] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [20:59:45] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [20:59:58] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:00:11] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:00:34] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:00:52] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:01:10] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:01:29] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:01:48] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:02:05] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:03:08] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:03:54] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:04:41] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:05:27] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:06:13] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:07:02] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:09:00] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:10:34] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:12:07] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:13:38] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:15:07] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:16:39] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:20:35] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:23:42] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:26:47] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:29:52] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:32:49] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:35:47] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:41:27] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:45:48] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:50:14] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:54:43] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [21:59:08] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. log_loss for c = 10 is 0.043845870114856605 log_loss for c = 50 is 0.035923358326618346 log_loss for c = 100 is 0.03562863183330795 log_loss for c = 500 is 0.03537819478049582 log_loss for c = 1000 is 0.035378129314047796 log_loss for c = 2000 is 0.035377801687070186 log_loss for c = 3000 is 0.035378450368450026
[22:04:02] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [22:09:37] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [22:14:08] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [22:18:27] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [22:22:59] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [22:27:31] WARNING: ..\src\learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. For values of best alpha = 2000 The train log loss is: 0.010848637377927182 For values of best alpha = 2000 The cross validation log loss is: 0.035378450368450026 For values of best alpha = 2000 The test log loss is: 0.022632302818540218
Cannot run this due to computational limitation.
if False:
x_cfl=XGBClassifier()
prams={
'learning_rate':[0.01,0.03,0.05,0.1,0.15,0.2],
'n_estimators':[100,200,500,1000,2000],
'max_depth':[3,5,10],
'colsample_bytree':[0.1,0.3,0.5,1],
'subsample':[0.1,0.3,0.5,1]
}
random_cfl=RandomizedSearchCV(x_cfl,param_distributions=prams,verbose=10,n_jobs=-1,)
random_cfl.fit(X_train_merge, y_train_merge)
if False:
print (random_cfl.best_params_)
# find more about XGBClassifier function here http://xgboost.readthedocs.io/en/latest/python/python_api.html?#xgboost.XGBClassifier
# -------------------------
# default paramters
# class xgboost.XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True,
# objective='binary:logistic', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1,
# max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1,
# scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)
# some of methods of RandomForestRegressor()
# fit(X, y, sample_weight=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None)
# get_params([deep]) Get parameters for this estimator.
# predict(data, output_margin=False, ntree_limit=0) : Predict with data. NOTE: This function is not thread safe.
# get_score(importance_type='weight') -> get the feature importance
# -----------------------
# video link2: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/what-are-ensembles/
# -----------------------
if False:
x_cfl=XGBClassifier(n_estimators=2000,max_depth=10,learning_rate=0.05,colsample_bytree=0.3,subsample=0.3,nthread=-1)
x_cfl.fit(X_train_merge,y_train_merge,verbose=True)
sig_clf = CalibratedClassifierCV(x_cfl, method="sigmoid")
sig_clf.fit(X_train_merge, y_train_merge)
predict_y = sig_clf.predict_proba(X_train_merge)
print ('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(y_train_merge, predict_y))
predict_y = sig_clf.predict_proba(X_cv_merge)
print('For values of best alpha = ', alpha[best_alpha], "The cross validation log loss is:",log_loss(y_cv_merge, predict_y))
predict_y = sig_clf.predict_proba(X_test_merge)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(y_test_merge, predict_y))
plot_confusion_matrix(y_test_asm,sig_clf.predict(X_test_merge))
fields = ['Features', 'Model', 'Hyper-Parameters', 'Train Log Loss', 'CV Log Loss', 'Test Log Loss']
data = [
['Unigram Byte File', 'Random Forest', '-', '2.49', '2.49', '2.44'],
['Unigram Byte File', 'K-Nearest Neighbour', 'k=3', '0.1486', '0.1541', '0.1590'],
['Unigram Byte File', 'Logistic Regression', 'C=100', '0.8552', '0.8755', '0.8560'],
['Unigram Byte File', 'Random Forest', 'n_estimators=500', '0.0450', '0.0458', '0.0432'],
['Unigram Byte File', 'XGBoost', 'n_estimators=2000, learning_rate=0.15, colsample_bytree=0.5, max_depth=5', '0.0214', '0.904', '0.0732'],
['Unigram Asm File', 'K-Nearest Neighbour', 'k=1', '0.0451', '0.0514', '0.0494'],
['Unigram Asm File', 'Logistic Regression', 'C=0.1', '1.0089', '0.9949', '1.0159'],
['Unigram Asm File', 'Random Forest', 'n_estimators=2000', '0.0213', '0.0222', '0.0206'],
['Unigram Asm File', 'XGBoost', 'n_estimators=2000,subsample=0.5,learning_rate=0.05,colsample_bytree=0.3,max_depth=5', '0.0131', '0.0252', '0.0378'],
['Asm Pixel Intensity', 'K-Nearest Neighbour', 'k=5', '0.1640', '0.1941', '0.1982'],
['Asm Pixel Intensity', 'Logistic Regression', 'C=1', '0.6719', '0.6864', '0.6736'],
['Asm Pixel Intensity', 'Random Forest', 'n_estimators=3000', '0.0916', '0.2075', '0.2164' ],
['Asm Pixel Intensity', 'XGBoost', 'n_estimators=1000,subsample=0.3,learning_rate=0.2,colsample_bytree=0.1,max_depth=3', '0.1493', '0.1801', '0.1931'],
['Bigram Byte File', 'K-Nearest Neighbour', 'k=3', '0.4087', '0.7043', '0.6856'],
['Bigram Byte File', 'Logistic Regression', 'C=100', '0.8698', '0.9296', '0.9402'],
['Bigram Byte File', 'Random Forest', 'n_estimators=1000 ', '0.0550', '0.2962', '0.2929'],
['Bigram Byte File', 'XGBoost', 'n_estimators=1000,subsample=0.5,learning_rate=0.1,colsample_bytree=0.5,max_depth=5', '0.0580', '0.2046', '0.2078'],
['Merged Features', 'Random Forest', 'n_estimators=10', '0.0137', '0.0987', '0.0431'],
['Merged Features', 'XGBoost', 'n_estimators=2000', '0.0100', '0.0353', '0.0226']
]
result = pd.DataFrame(data, columns = fields)
result
| Features | Model | Hyper-Parameters | Train Log Loss | CV Log Loss | Test Log Loss | |
|---|---|---|---|---|---|---|
| 0 | Unigram Byte File | Random Forest | - | 2.49 | 2.49 | 2.44 |
| 1 | Unigram Byte File | K-Nearest Neighbour | k=3 | 0.1486 | 0.1541 | 0.1590 |
| 2 | Unigram Byte File | Logistic Regression | C=100 | 0.8552 | 0.8755 | 0.8560 |
| 3 | Unigram Byte File | Random Forest | n_estimators=500 | 0.0450 | 0.0458 | 0.0432 |
| 4 | Unigram Byte File | XGBoost | n_estimators=2000, learning_rate=0.15, colsamp... | 0.0214 | 0.904 | 0.0732 |
| 5 | Unigram Asm File | K-Nearest Neighbour | k=1 | 0.0451 | 0.0514 | 0.0494 |
| 6 | Unigram Asm File | Logistic Regression | C=0.1 | 1.0089 | 0.9949 | 1.0159 |
| 7 | Unigram Asm File | Random Forest | n_estimators=2000 | 0.0213 | 0.0222 | 0.0206 |
| 8 | Unigram Asm File | XGBoost | n_estimators=2000,subsample=0.5,learning_rate=... | 0.0131 | 0.0252 | 0.0378 |
| 9 | Asm Pixel Intensity | K-Nearest Neighbour | k=5 | 0.1640 | 0.1941 | 0.1982 |
| 10 | Asm Pixel Intensity | Logistic Regression | C=1 | 0.6719 | 0.6864 | 0.6736 |
| 11 | Asm Pixel Intensity | Random Forest | n_estimators=3000 | 0.0916 | 0.2075 | 0.2164 |
| 12 | Asm Pixel Intensity | XGBoost | n_estimators=1000,subsample=0.3,learning_rate=... | 0.1493 | 0.1801 | 0.1931 |
| 13 | Bigram Byte File | K-Nearest Neighbour | k=3 | 0.4087 | 0.7043 | 0.6856 |
| 14 | Bigram Byte File | Logistic Regression | C=100 | 0.8698 | 0.9296 | 0.9402 |
| 15 | Bigram Byte File | Random Forest | n_estimators=1000 | 0.0550 | 0.2962 | 0.2929 |
| 16 | Bigram Byte File | XGBoost | n_estimators=1000,subsample=0.5,learning_rate=... | 0.0580 | 0.2046 | 0.2078 |
| 17 | Merged Features | Random Forest | n_estimators=10 | 0.0137 | 0.0987 | 0.0431 |
| 18 | Merged Features | XGBoost | n_estimators=2000 | 0.0100 | 0.0353 | 0.0226 |
Best Train Log Loss = 0.01 Best Testing log loss = 0.02